Text Embedding in Ruby
Text embeddings are the building block behind “semantic” features: search that works even when users don’t type the exact keywords, clustering similar documents, and quick similarity checks.
In this post, we’ll run Google’s EmbeddingGemma 300M locally in Ruby using ONNX Runtime (native) and Hugging Face Tokenizers (native). The goal isn’t to re-teach Ruby basics - it’s to show the few places people usually get surprised: model lifecycle/memory, task-specific prompting, and the small bits of vector math that matter in production.
One quick note before we start: this isn’t a “Ruby vs Python vs $LANG” post.
- Python is excellent for research and experimentation.
- Java/Go/Node are common in production services.
- Ruby shines when you want to ship product fast with a small surface area.
ONNX is useful precisely because it’s language-agnostic: you can keep the model format stable and choose the runtime that fits your app. Here we’re optimizing for a very specific scenario: you already have a Ruby codebase (probably with Rails, Sidekiq workers etc.) and you want embeddings without adding a second stack.
About EmbeddingGemma
EmbeddingGemma is Google’s state-of-the-art embedding model built from the Gemma 3 family. It’s designed for a wide range of semantic tasks including retrieval, classification, clustering, and similarity matching.
Key characteristics:
- 300M parameters - Compact enough for production deployment
- 768-dimensional embeddings - Rich semantic representations
- 100+ languages supported - Multilingual capabilities
- Maximum input context length of 2048 tokens
- Open license - Provided with open weights for responsible commercial use (Gemma License)
What You’ll Build
By the end of this guide, you’ll have a working Ruby implementation that can:
- Convert text to 768-dimensional semantic embeddings
- Compute similarity between documents
- Rank documents by semantic relevance
- Reduce embedding dimensions efficiently (Matryoshka Representation Learning)
- Handle batch processing with automatic padding
Understanding the Components
Let’s break down what we need to build:
┌─────────────────────────────────────────────────────────────────┐
│ Your Application │
└───────────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EmbeddingGemma Class │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Model │ │ Tokenizer │ │ Similarity & Ranking │ │
│ │ Loader │ │ Wrapper │ │ Functions │ │
│ └─────────────┘ └──────────────┘ └────────────────────────┘ │
└───────────────────────────────┬─────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────────┐ ┌─────────────────────┐
│ ONNX Runtime │ │ HuggingFace │
│ (C++ bindings) │ │ Tokenizers (Rust) │
└───────────────────┘ └─────────────────────┘
The rest of this article focuses on the “thin waist” between Ruby and native ML libraries:
onnxruntime(ankane/onnxruntime-ruby) - ONNX Runtime bindingstokenizers(ankane/tokenizers-ruby) - Hugging Face Tokenizers bindings
We’ll build this layer by layer, starting with the foundation.
Step 1: Setting Up the Project
First, create a new directory for your project and set up the gem dependencies. If you’re adding this to an existing app, the only thing that really matters is: these gems include native components. That usually means you want reproducible versions (lockfile) and a CI image that can compile native extensions.
mkdir ruby-embeddings
cd ruby-embeddings
# Initialize bundle
bundle init
# Add required gems
bundle add onnxruntime tokenizers
bundle add minitest -group=test
Your Gemfile should now look like:
# frozen_string_literal: true
source "https://rubygems.org"
gem "onnxruntime", "~> 0.10.1"
gem "tokenizers", "~> 0.6.3"
gem "minitest", "~> 6.0", group: :test
If you hit build errors installing native gems, it’s not “Ruby being bad” - it’s the reality of binding to fast C++/Rust libraries. Fix it once in your CI image, then forget it.
Step 2: Downloading the Model from Hugging Face
Before we can use the embedding model, we need to download it from Hugging Face. The model is hosted at: https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX
Required Files
You need to download these files from the repository:
| File | Size | Description |
|---|---|---|
onnx/model_q4.onnx | ~507 KB | Main ONNX model (Q4 quantized) |
onnx/model_q4.onnx_data | ~188 MB | Model weights data |
tokenizer.json | ~19 MB | Tokenizer vocabulary and merges |
tokenizer_config.json | ~1.1 MB | Tokenizer configuration |
config.json | ~2 KB | Model metadata |
Total size: ~209 MB
For local development, Git LFS is usually the easiest. For automation (CI, containers, repeatable builds), a script you can run idempotently is nicer.
Download Method 1: Using Git LFS
Note: the onnx/ directory in this repo contains multiple exported variants, so a full Git LFS clone can be several GB. If you only want the small Q4 model + tokenizer, the script below downloads just the files you need.
# Install git-lfs if not already installed
# Ubuntu/Debian: sudo apt install git-lfs
# macOS: brew install git-lfs
git lfs install
git clone https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX
# The model will be in the embeddinggemma-300m-ONNX directory
Download Method 2: Using a Ruby Script (Reproducible)
If you want something you can commit (and rerun in CI), a tiny Ruby downloader is hard to beat.
Practical notes:
- Hugging Face may redirect downloads (the script follows redirects).
- In CI, you might want to cache the downloaded
models/directory to avoid pulling ~200MB on every run. - If you’re shipping a gem/app, think about whether you want to bundle weights, download at install time, or download at first run.
# download_model.rb
require 'net/http'
require 'fileutils'
require 'uri'
BASE_URL = "https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main"
MODEL_DIR = "models/embeddinggemma"
FILES = %w[
onnx/model_q4.onnx
onnx/model_q4.onnx_data
tokenizer.json
tokenizer_config.json
config.json
]
FileUtils.mkdir_p(MODEL_DIR)
FileUtils.mkdir_p("#{MODEL_DIR}/onnx")
FILES.each do |file|
dest = File.join(MODEL_DIR, file)
if File.exist?(dest)
puts "✓ Already exists: #{file}"
next
end
puts "Downloading #{file}..."
url = "#{BASE_URL}/#{file}"
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
request = Net::HTTP::Get.new(uri)
http.request(request) do |response|
case response
when Net::HTTPRedirection
# Follow HuggingFace redirects
new_uri = URI(response['location'])
unless new_uri.absolute?
new_uri = URI("#{uri.scheme}://#{uri.host}#{response['location']}")
end
Net::HTTP.start(new_uri.host, new_uri.port, use_ssl: true) do |redirect_http|
redirect_response = redirect_http.request(Net::HTTP::Get.new(new_uri))
File.open(dest, 'wb') { |f| f.write(redirect_response.body) }
end
when Net::HTTPSuccess
File.open(dest, 'wb') { |f| f.write(response.body) }
else
raise "Failed to download: #{response.code} #{response.message}"
end
end
end
size = (File.size(dest).to_f / 1024 / 1024).round(2)
puts " ✓ Downloaded #{file} (#{size} MB)"
end
puts "\n✓ Download complete!"
puts "Model files located in: #{MODEL_DIR}"
Run the script:
ruby download_model.rb
Organizing Your Files
After downloading, your project structure should look like:
ruby-embeddings/
├── download_model.rb
├── Gemfile
├── Gemfile.lock
└── models/
└── embeddinggemma/
├── config.json
├── onnx/
│ ├── model_q4.onnx
│ └── model_q4.onnx_data
├── tokenizer_config.json
└── tokenizer.json
The structure of models mirrors the Hugging Face repository, keeping the ONNX files in the onnx/ subdirectory.
Model Variants
The onnx/ folder currently contains multiple exported variants. You only need one matching pair (.onnx + .onnx_data). In broad strokes:
model_q4.onnx+model_q4.onnx_data(~197 MB weights) — smallest, good default for CPUmodel_fp16.onnx+model_fp16.onnx_data(~617 MB weights) — larger, FP16 variantmodel.onnx+model.onnx_data(~1.23 GB weights) — largest “full” export
You may also see additional variants like model_q4f16.onnx, model_no_gather_q4.onnx, or model_quantized.onnx intended for different runtimes or compatibility. If you’re just getting started, stick with model_q4.onnx.
Step 3: Understanding ONNX Models
ONNX is the “shipping container” format for models. The big win is practical: we can run a high-quality embedding model in a Ruby process without introducing a second runtime just to host model inference. That doesn’t make other stacks “worse” - it just keeps your deployment simpler when Ruby is already your home base.
The trade-off is that you, the application developer, now own a few low-level concerns that higher-level frameworks often hide:
- Inputs must match the model’s expected shapes and dtypes.
- Model initialization is expensive and involves native resources.
- You need to be intentional about lifecycle (threads/processes, caching, warmup).
The flow is still simple:
Text → Tokenizer → input_ids + attention_mask → ONNX Runtime → embedding vector
Before we write any wrapper code, inspect the model’s inputs/outputs. This saves you from “guess and debug” later:
# explore_model.rb
require 'onnxruntime'
# Point to your downloaded model
model_path = 'models/embeddinggemma/onnx/model_q4.onnx'
model = OnnxRuntime::Model.new(model_path)
puts "Model Inputs:"
model.inputs.each do |input|
puts " Name: #{input[:name]}"
puts " Type: #{input[:type]}"
puts " Shape: #{input[:shape].inspect}"
end
puts "\nModel Outputs:"
model.outputs.each do |output|
puts " Name: #{output[:name]}"
puts " Type: #{output[:type]}"
puts " Shape: #{output[:shape].inspect}"
end
Typical output for embeddinggemma model:
Model Inputs:
Name: input_ids
Type: tensor(int64)
Shape: ["batch_size", "sequence_length"]
Name: attention_mask
Type: tensor(int64)
Shape: ["batch_size", "total_sequence_length"]
Model Outputs:
Name: last_hidden_state
Type: tensor(float)
Shape: ["batch_size", "sequence_length", 768]
Name: sentence_embedding
Type: tensor(float)
Shape: ["batch_size", 768]
Two gotchas worth calling out early:
- Many models expect
int64inputs (Ruby integers are fine, but don’t accidentally pass floats). - Loading
OnnxRuntime::Model.new(...)is not a cheap object allocation - it maps/parses model data and allocates native resources. Treat it like opening a database connection: do it once, then reuse.
Step 4: Building the Tokenizer
The tokenizer step is mostly plumbing, but it’s also where a lot of subtle bugs sneak in. The model wants a rectangular batch: every input needs the same sequence length, with an attention_mask telling the model which tokens are “real” vs padding.
This concept exists in every ecosystem (Python, JS, Rust): tokenization is deterministic, but the defaults differ. When results look weird, it’s often because padding/truncation differed between “how you think it works” and “what your tokenizer actually did”.
We’ll use the tokenizers gem (Rust under the hood) so tokenization isn’t your performance bottleneck.
# lib/tokenizer.rb
require "tokenizers"
class Tokenizer
attr_reader :tokenizer
def initialize(tokenizer_path)
@tokenizer = Tokenizers::Tokenizer.from_file(tokenizer_path)
# Enable padding for this tokenizer
@tokenizer.enable_padding
end
def tokenize(texts)
texts = [ texts ] unless texts.is_a?(Array)
# Encode all texts with padding
result = @tokenizer.encode_batch(texts)
# Extract input_ids and attention_mask
input_ids = result.map(&:ids)
attention_mask = result.map(&:attention_mask)
{
input_ids: input_ids,
attention_mask: attention_mask
}
end
end
Tricky cases to keep in mind:
- Batching: encoding one string at a time is dramatically slower than
encode_batch. - Padding strategy: you generally want “pad to longest in batch”, not “pad to max length” for every request.
- Long inputs: if you ingest untrusted text (web pages, logs), decide up front whether to truncate or reject inputs past the model’s context window.
Step 5: Building the Core Embedding Class
Now let’s build the main embedding class that ties everything together:
# lib/embedding_model.rb
require_relative 'tokenizer'
require 'onnxruntime'
class EmbeddingModel
EMBEDDING_DIM = 768
DEFAULT_MODEL_DIR = 'models/embeddinggemma'
PREFIXES = {
query: "task: search result | query: ",
document: "title: none | text: ",
classification: "task: classification | query: ",
clustering: "task: clustering | query: ",
qa: "task: question answering | query: ",
similarity: "task: sentence similarity | query: "
}.freeze
attr_reader :model, :tokenizer
def initialize(model_dir: DEFAULT_MODEL_DIR)
# Load ONNX model
@model_path = File.join(model_dir, 'onnx/model_q4.onnx')
@model = OnnxRuntime::Model.new(@model_path)
# Cache input/output names
@input_names = @model.inputs.map { |i| i[:name] }
@output_names = @model.outputs.map { |o| o[:name] }
# Initialize tokenizer
tokenizer_path = File.join(model_dir, 'tokenizer.json')
@tokenizer = Tokenizer.new(tokenizer_path)
end
# Generate embeddings for one or more texts
def embed(texts, task: :query)
texts = [texts] unless texts.is_a?(Array)
return [] if texts.empty?
# Add task-specific prefix
prefix = PREFIXES[task] || PREFIXES[:query]
prefixed_texts = texts.map { |t| prefix + t }
# Tokenize
tokens = @tokenizer.tokenize(prefixed_texts)
# Prepare input for ONNX model
inputs = {
@input_names[0] => tokens[:input_ids],
@input_names[1] => tokens[:attention_mask]
}
# Run inference
outputs = @model.predict(inputs)
# Extract embeddings
extract_embedding(outputs)
end
# Generate embedding for single text
def embed_single(text, task: :query)
embed(text, task: task).first
end
private
def extract_embedding(outputs)
embedding = if outputs.key?("sentence_embedding")
outputs["sentence_embedding"]
else
outputs[@output_names.first]
end
embedding
end
end
Prompt Instructions for EmbeddingGemma
EmbeddingGemma uses task-specific prompt prefixes to optimize embeddings for different use cases. These prefixes are part of the model’s training and significantly improve performance. Here’s what happens internally:
# For a search query:
model.embed_single("machine learning", task: :query)
# Internally becomes: "task: search result | query: machine learning"
# For indexing a document:
model.embed_single("Ruby is a programming language", task: :document)
# Internally becomes: "title: none | text: Ruby is a programming language"
Available task types:
| Task | Prefix | Use Case |
|---|---|---|
:query | task: search result | query: | Search/retrieval queries |
:document | title: none | text: | Indexing documents without titles |
:classification | task: classification | query: | Text classification |
:clustering | task: clustering | query: | Document clustering |
:qa | task: question answering | query: | Question answering |
:similarity | task: sentence similarity | query: | Semantic similarity |
Always use the appropriate task type for best results. Using query embeddings for documents (or vice versa) will produce suboptimal similarity scores.
This is one of those “it works but it’s wrong” failure modes. In a real app, it shows up as inconsistent search results that are hard to debug because nothing crashes.
For more details, see the EmbeddingGemma Model Card.
Step 6: Adding Model Caching
Loading ONNX models is expensive - and doing it per request is one of the fastest ways to turn a “works on my laptop” demo into a production outage.
In Ruby web servers like Puma, multiple threads can try to initialize the model at the same time. Without a lock, you can end up loading the same model multiple times concurrently (slow) and spiking memory (dangerous). A Mutex around the cache prevents that.
One more lifecycle detail that trips people up: a cache is per-process. If you run 4 Puma workers, you will likely have 4 separate model instances in memory (one per worker). That can be totally fine - just budget for it. If memory is tight, prefer fewer processes with more threads, or run embeddings in a dedicated worker/service where you control concurrency.
Let’s add class-level caching:
# lib/embedding_model.rb (updated)
class EmbeddingModel
# ... existing code ...
# Class-level cache
@model_cache = {}
@cache_mutex = Mutex.new
class << self
def cached_model(model_path)
@cache_mutex.synchronize do
@model_cache[model_path] ||= OnnxRuntime::Model.new(model_path)
end
end
def clear_cache
@cache_mutex.synchronize do
@model_cache.clear
end
end
end
# Update initialize to use cached model
def initialize(model_dir: DEFAULT_MODEL_DIR)
@model_path = File.join(model_dir, "onnx/model_q4.onnx")
@model = self.class.cached_model(@model_path)
@input_names = @model.inputs.map { |i| i[:name] }
@output_names = @model.outputs.map { |o| o[:name] }
tokenizer_path = File.join(model_dir, "tokenizer.json")
@tokenizer = Tokenizer.new(tokenizer_path)
end
end
Step 7: Implementing Similarity Functions
Now let’s add methods to compute similarity between embeddings:
Most embedding systems use cosine similarity because embeddings are meaningful primarily by direction (semantic “angle”), not absolute magnitude. If you use Euclidean distance on raw vectors, results can be skewed by vector norms.
One more practical note: if you ever truncate embeddings (MRL) or apply your own pooling, re-normalize afterwards - otherwise cosine similarity becomes inconsistent.
When you start debugging relevance, it helps to remember what cosine similarity looks like in practice:
- The mathematical range is [-1, 1].
- For many modern text embedding models, most “unrelated” pairs land near 0, and related pairs drift upward.
- Don’t overfit to a magic threshold across domains - always evaluate on your own data.
# lib/embedding_model.rb (add these methods)
# Compute cosine similarity between two embeddings
# Returns: Float between -1 and 1 (typically 0 to 1 for embeddings)
def cosine_similarity(vec1, vec2)
# Same vector optimization
return 1.0 if vec1.equal?(vec2)
# Compute dot product and norms
dot_product = 0.0
norm1_sq = 0.0
norm2_sq = 0.0
vec1.each_with_index do |v1, i|
v2 = vec2[i]
dot_product += v1 * v2
norm1_sq += v1 * v1
norm2_sq += v2 * v2
end
# Handle zero vectors
return 0.0 if norm1_sq.zero? || norm2_sq.zero?
# Cosine similarity: dot / (norm1 * norm2)
dot_product / Math.sqrt(norm1_sq * norm2_sq)
end
# Compute similarities between query and multiple documents
def similarities(query_embedding, document_embeddings)
document_embeddings.map do |doc_emb|
cosine_similarity(query_embedding, doc_emb)
end
end
# Rank documents by relevance to query
# Returns: Array of hashes with index, document, and score
def rank(query, documents)
# Generate embeddings with appropriate tasks
query_emb = embed_single(query, task: :query)
doc_embs = embed(documents, task: :document)
# Calculate similarity scores
scores = similarities(query_emb, doc_embs)
# Build results
results = scores.each_with_index.map do |score, idx|
{
index: idx,
document: documents[idx],
score: score
}
end
# Sort by score (highest first)
results.sort_by { |r| -r[:score] }
end
Step 8: Implementing Matryoshka Representation Learning
One of the most useful “systems” features of modern embedding models is MRL: you can truncate embeddings to smaller dimensions and still get decent results. This is a real lever in production: smaller vectors mean less RAM, less network bandwidth, and cheaper storage - at the cost of some retrieval quality.
This is also a nice example of “ML meets systems”: the model work was done upstream, but you (application dev) decide how to spend the budget. If you’re storing millions of vectors, dropping from 768 → 256 dims is often the difference between “fits in Postgres + pgvector” and “needs a dedicated vector store”.
# lib/embedding_model.rb (add this method)
# Truncate embedding to smaller dimension using MRL
# Dimensions: 768 (full) -> 512 -> 256 -> 128
# Accuracy loss: 0 -> -0.44 -> -1.47 -> -2.92 points
def truncate_embedding(embedding, dim: 256)
raise ArgumentError, "dim must be <= #{EMBEDDING_DIM}" if dim > EMBEDDING_DIM
# Return as-is if no truncation needed
return embedding.dup if dim >= embedding.length
# Truncate to first N dimensions
truncated = embedding[0, dim]
# Re-normalize (truncation changes the norm)
norm_sq = truncated.sum { |x| x * x }
return truncated if norm_sq.zero?
norm = Math.sqrt(norm_sq)
truncated.map { |x| x / norm }
end
Why re-normalize after truncation?
Cosine similarity depends on vector norms. When we truncate, we change the norm, which affects similarity calculations. Re-normalizing ensures consistent similarity scores regardless of dimension.
Step 9: Putting It All Together
Here’s the complete implementation for the EmbeddingGemma 300M model. Keep in mind that this is tutorial code which certainly does not cover all aspects of embedding or edge cases.
# lib/embedding_model.rb
# frozen_string_literal: true
require_relative "tokenizer"
require "onnxruntime"
class EmbeddingModel
EMBEDDING_DIM = 768
DEFAULT_MODEL_DIR = "models/embeddinggemma"
PREFIXES = {
query: "task: search result | query: ",
document: "title: none | text: ",
classification: "task: classification | query: ",
clustering: "task: clustering | query: ",
qa: "task: question answering | query: ",
similarity: "task: sentence similarity | query: "
}.freeze
@model_cache = {}
@cache_mutex = Mutex.new
class << self
def cached_model(model_path)
@cache_mutex.synchronize do
@model_cache[model_path] ||= OnnxRuntime::Model.new(model_path)
end
end
def clear_cache
@cache_mutex.synchronize do
@model_cache.clear
end
end
end
attr_reader :model, :tokenizer
def initialize(model_dir: DEFAULT_MODEL_DIR)
@model_path = File.join(model_dir, "onnx/model_q4.onnx")
@model = self.class.cached_model(@model_path)
@input_names = @model.inputs.map { |i| i[:name] }
@output_names = @model.outputs.map { |o| o[:name] }
tokenizer_path = File.join(model_dir, "tokenizer.json")
@tokenizer = Tokenizer.new(tokenizer_path)
end
def embed(texts, task: :query)
texts = [ texts ] unless texts.is_a?(Array)
return [] if texts.empty?
prefix = PREFIXES[task] || PREFIXES[:query]
prefixed_texts = texts.map { |t| prefix + t }
tokens = @tokenizer.tokenize(prefixed_texts)
inputs = {
@input_names[0] => tokens[:input_ids],
@input_names[1] => tokens[:attention_mask]
}
outputs = @model.predict(inputs)
extract_embedding(outputs)
end
def embed_single(text, task: :query)
embed(text, task: task).first
end
def cosine_similarity(vec1, vec2)
return 1.0 if vec1.equal?(vec2)
dot_product = 0.0
norm1_sq = 0.0
norm2_sq = 0.0
vec1.each_with_index do |v1, i|
v2 = vec2[i]
dot_product += v1 * v2
norm1_sq += v1 * v1
norm2_sq += v2 * v2
end
return 0.0 if norm1_sq.zero? || norm2_sq.zero?
dot_product / Math.sqrt(norm1_sq * norm2_sq)
end
def similarities(query_embedding, document_embeddings)
document_embeddings.map do |doc_emb|
cosine_similarity(query_embedding, doc_emb)
end
end
def rank(query, documents)
query_emb = embed_single(query, task: :query)
doc_embs = embed(documents, task: :document)
scores = similarities(query_emb, doc_embs)
scores.each_with_index.map do |score, idx|
{ index: idx, document: documents[idx], score: score }
end.sort_by { |r| -r[:score] }
end
def truncate_embedding(embedding, dim: 256)
raise ArgumentError, "dim must be <= #{EMBEDDING_DIM}" if dim > EMBEDDING_DIM
return embedding.dup if dim >= embedding.length
truncated = embedding[0, dim]
norm_sq = truncated.sum { |x| x * x }
return truncated if norm_sq.zero?
norm = Math.sqrt(norm_sq)
truncated.map { |x| x / norm }
end
private
def extract_embedding(outputs)
embedding = if outputs.key?("sentence_embedding")
outputs["sentence_embedding"]
else
outputs[@output_names.first]
end
embedding
end
end
Step 10: Building a Complete Application
Let’s build a practical but very simple semantic search application:
# app.rb
require_relative "lib/embedding_model"
puts "Loading model..."
model = EmbeddingModel.new # Uses default model_dir: "models/embeddinggemma"
puts "Model loaded!\n\n"
# Example 1: Semantic Search
puts "=" * 60
puts "Example 1: Semantic Search"
puts "=" * 60
documents = [
"Ruby is a dynamic, open source programming language created by Yukihiro Matsumoto",
"Python is a high-level programming language popular in data science",
"JavaScript is primarily used for web development",
"Rails is a web framework written in Ruby",
"Queen likes wearing jewelry with small ruby gems"
]
query = "Tell me about Ruby programming language"
puts "Query: #{query}\n\n"
results = model.rank(query, documents)
puts "Results:"
results.each_with_index do |result, i|
puts "#{i + 1}. [#{result[:score].round(3)}] #{result[:document]}"
end
# Example 2: Document Similarity
puts "\n" + "=" * 60
puts "Example 2: Document Similarity"
puts "=" * 60
text1 = "Ruby gems are packages for Ruby libraries"
text2 = "Ruby packages are called gems"
text3 = "Python uses pip for package management"
text4 = "Cat like to watch fish swimming in the aquarium"
emb1 = model.embed_single(text1)
emb2 = model.embed_single(text2)
emb3 = model.embed_single(text3)
emb4 = model.embed_single(text4)
sim12 = model.cosine_similarity(emb1, emb2)
sim13 = model.cosine_similarity(emb1, emb3)
sim14 = model.cosine_similarity(emb1, emb4)
puts "Text 1: #{text1}"
puts "Text 2: #{text2}"
puts "Text 3: #{text3}"
puts "Text 4: #{text4}"
puts "\nSimilarity (1, 2): #{sim12.round(4)}"
puts "Similarity (1, 3): #{sim13.round(4)}"
puts "Similarity (1, 4): #{sim14.round(4)}"
# Example 3: Efficient Storage with MRL
puts "\n" + "=" * 60
puts "Example 3: Matryoshka Representation Learning"
puts "=" * 60
text = "Machine learning is a subset of artificial intelligence"
full_emb = model.embed_single(text)
puts "Full embedding: #{full_emb.size} dimensions"
[ 512, 256, 128 ].each do |dim|
truncated = model.truncate_embedding(full_emb, dim: dim)
puts "Truncated to #{dim} dimensions: #{truncated.size} dimensions"
end
# Example 4: Building a Simple Recommender
puts "\n" + "=" * 60
puts "Example 4: Document Recommender"
puts "=" * 60
class DocumentRecommender
def initialize(documents, model)
@model = model
@documents = documents
@embeddings = model.embed(documents, task: :document)
end
def recommend(query, top_k: 3)
query_emb = @model.embed_single(query, task: :query)
scores = @embeddings.map.with_index do |doc_emb, idx|
{
document: @documents[idx],
score: @model.cosine_similarity(query_emb, doc_emb)
}
end
scores.sort_by { |s| -s[:score] }.first(top_k)
end
end
corpus = [
"Ruby on Rails is a web application framework",
"Sinatra is a lightweight Ruby web framework",
"Django is a Python web framework",
"Flask is a microframework for Python"
]
recommender = DocumentRecommender.new(corpus, model)
puts "Query: 'Ruby web development'\n\n"
recommendations = recommender.recommend("Ruby web development")
recommendations.each_with_index do |rec, i|
puts "#{i + 1}. [#{rec[:score].round(3)}] #{rec[:document]}"
end
If you run it (ruby app.rb), you’ll see output like:
Loading model...
Model loaded!
============================================================
Example 1: Semantic Search
============================================================
Query: Tell me about Ruby programming language
Results:
1. [0.622] Ruby is a dynamic, open source programming language created by Yukihiro Matsumoto
2. [0.505] Rails is a web framework written in Ruby
3. [0.278] Queen likes wearing jewelry with small ruby gems
4. [0.254] JavaScript is primarily used for web development
5. [0.247] Python is a high-level programming language popular in data science
============================================================
Example 2: Document Similarity
============================================================
Text 1: Ruby gems are packages for Ruby libraries
Text 2: Ruby packages are called gems
Text 3: Python uses pip for package management
Text 4: Cat like to watch fish swimming in the aquarium
Similarity (1, 2): 0.9161
Similarity (1, 3): 0.4411
Similarity (1, 4): 0.079
============================================================
Example 3: Matryoshka Representation Learning
============================================================
Full embedding: 768 dimensions
Truncated to 512 dimensions: 512 dimensions
Truncated to 256 dimensions: 256 dimensions
Truncated to 128 dimensions: 128 dimensions
============================================================
Example 4: Document Recommender
============================================================
Query: 'Ruby web development'
1. [0.619] Ruby on Rails is a web application framework
2. [0.55] Sinatra is a lightweight Ruby web framework
3. [0.368] Django is a Python web framework
How to read this:
- Semantic search (Example 1): the “jewelry” sentence contains the token ruby, so it shows up (3rd), but it scores far below the actual programming-language results. This is the kind of ambiguity you want in your smoke test because it validates the embeddings are capturing meaning, not just keyword overlap.
- Similarity (Example 2): paraphrases usually land close (0.9161). A nearby-but-different topic can still be moderately similar (0.4411). Truly unrelated text should be near 0 (0.079). Treat the absolute values as heuristics — what you really care about is separation and ordering.
- MRL truncation (Example 3): the model gives you 768 dims, but you can store 256/128 dims for cheaper indexes. Expect some recall loss; the point is you can choose the trade-off per index/tier.
- Recommender (Example 4): pre-embedding the corpus with
task: :documentkeeps query-time cheap, and you can see it ranks Ruby web docs above Python web docs.
Note: exact scores can vary across model variants/quantization, CPU instruction sets, and runtime versions. Use these numbers for sanity checks and relative ranking, not as a cross-machine “golden” baseline.
Step 11: Performance Optimization
Batch Processing
Always prefer batching, but keep two constraints in mind:
- Batch size is a latency + memory dial. Bigger batches improve throughput but increase tail latency and can blow up RAM if your inputs are long.
- Padding is real work. If you mix tiny strings and giant paragraphs in the same batch, you pay for the longest sequence.
The simple win is: batch, and cap batch size.
def embed_in_batches(model, texts, task:, batch_size: 32)
texts.each_slice(batch_size).flat_map do |chunk|
model.embed(chunk, task: task)
end
end
# Slow: one model call per string
embeddings = texts.map { |t| model.embed_single(t, task: :document) }
# Better: a few bigger model calls
embeddings = embed_in_batches(model, texts, task: :document, batch_size: 32)
If you ingest highly variable text lengths (titles + long pages), a pragmatic trick is to batch by length to reduce padding waste. Don’t over-engineer it; even a coarse bucket (“short” vs “long”) helps.
Pre-compute Embeddings
If your corpus is static (docs, help center, product catalog), the fastest query is the one where you do no document-side inference.
Two practical upgrades to the “store embeddings and scan them” approach:
- Store truncated vectors (e.g. 256 dims) using MRL: smaller index, faster scoring.
- Normalize once, then score with a dot product: for unit vectors, cosine similarity equals dot product.
Here’s a small in-memory index that does both and avoids sorting the full corpus when you only need top-k.
class SemanticIndex
def initialize(documents, model, dim: 256)
@model = model
@documents = documents
@dim = dim
raw = model.embed(documents, task: :document)
@embeddings = raw.map { |e| @model.truncate_embedding(e, dim: @dim) }
end
def search(query, top_k: 5)
q = @model.embed_single(query, task: :query)
q = @model.truncate_embedding(q, dim: @dim)
top = [] # [{ score:, index: }]
@embeddings.each_with_index do |doc_emb, idx|
score = dot(q, doc_emb)
if top.length < top_k
top << { score: score, index: idx }
top.sort_by! { |r| r[:score] } # smallest first
next
end
next if score <= top.first[:score]
top[0] = { score: score, index: idx }
top.sort_by! { |r| r[:score] }
end
top.reverse.map { |r| { score: r[:score], document: @documents[r[:index]] } }
end
private
def dot(a, b)
sum = 0.0
a.each_with_index { |av, i| sum += av * b[i] }
sum
end
end
For a toy demo, keeping this in memory is fine. For anything real:
- Persist embeddings (and their
dim) so you don’t rebuild on every boot. - Consider a vector index (pgvector / dedicated store) once $n$ is large enough that $O(n)$ scans hurt.
The meta-rule: optimize the shape of work first (batch, cache, precompute), then optimize the math.
Step 12: Testing Your Implementation
When testing embedding code, focus on invariants. Most failures here aren’t “Ruby is broken”; they’re “the output has the right shape but the wrong semantics”.
Useful invariants to assert:
- Embedding dimensionality is always 768 (or your expected dimension after truncation).
- Cosine similarity is symmetric:
sim(a, b) ≈ sim(b, a). - Truncated embeddings are re-normalized (unit norm) if you rely on cosine similarity.
- Ranking results are well-formed and ordered: each item has
index,document,score, and results are sorted by descending score.
These checks catch the most common mistakes: accidentally using last_hidden_state instead of the pooled embedding, forgetting to re-normalize after truncation, and subtle shape/ordering bugs in batching.
If you want one “semantic” smoke test, keep it coarse and deterministic: verify that a query about “Ruby programming” ranks Ruby-language documents above unrelated ones (avoid asserting exact floating-point scores across machines/versions).
Step 13: Deployment Considerations - Memory Management
Loading the ONNX model, tokenizer, and weight files on every request is expensive - it incurs repeated disk I/O, deserialization, and native resource allocation which increases latency, CPU usage, and memory churn. Cache the loaded model and tokenizer (lazy-load once), provide a controlled clear_cache method (e.g. EmbeddingModel.clear_cache) to release resources, and guard initialization with a Mutex for thread-safety. For predictable latency consider preloading at startup or using worker processes/pools under high concurrency.
Operationally, this looks the same no matter what language you use:
- If you run multiple processes, expect memory usage to multiply.
- If you run multiple threads, you need to think about thread-safety around initialization and shared state.
- If you care about tail latency, warm up the model (load once) during boot rather than on the first user request.
There isn’t one “right” answer here - it depends on your traffic shape and budget. The important part is to make the trade-off explicit instead of letting the default server config choose for you.
class EmbeddingService
def initialize(model_dir: 'models/embeddinggemma')
@model_dir = model_dir
@model = nil
end
def model
@model ||= EmbeddingModel.new(model_dir: @model_dir)
end
def reset_model
EmbeddingModel.clear_cache
@model = nil
end
end
Where to Go From Here
If you got this far, you already have the hard part: a clean bridge between Ruby and a native inference runtime. From here, the interesting work is product work and operations work.
Here are a few directions that are both useful and educational:
1. Choose an embedding “shape” for your system
- If you embed a handful of strings per request: keep it in-process and cache the model.
- If you embed large batches (indexing) or want isolation: run embeddings in a dedicated worker/service.
- If you’re cost-optimizing storage: use MRL truncation (e.g. 256 dims) and be consistent across indexing + querying.
2. Add an evaluation loop (even a tiny one)
- Collect ~20 real queries and the documents you expect to match.
- Measure whether the top-k results look right.
- This is more valuable than tweaking thresholds blindly.
3. Harden the pipeline
- Put explicit bounds on input size.
- Decide how you handle failures (retry vs fail fast).
- Warm up at boot if you care about first-request latency.
4. Scale storage and retrieval
- For small/medium projects, Postgres + pgvector is a great starting point.
- For larger corpora or specialized needs, often with managed version of services, look at Milvus or Weaviate.
- (The “right” database depends more on ops constraints than on Ruby vs any other language.)
5. Experiment with deployment options
- ONNX Runtime can run on CPU very well; GPU can help when you have bigger models or higher throughput.
- If you do add a remote embedding API fallback, treat it as an operational trade-off (latency, privacy, cost), not a moral one.
Key Takeaways
- ONNX + Ruby is a pragmatic combo when Ruby is already your deployment target.
onnxruntime+tokenizersdo the heavy lifting: Ruby stays the orchestration layer, native code does the hot path.- Task prefixes are a correctness requirement, not an optimization.
- Cosine similarity is simple, but not “free”: normalization and consistency (especially with truncation) matter.
- Batching and caching are the first performance wins before you reach for GPUs.
The broader point: you don’t need to pick a “one true language” for ML. You can learn the concepts and apply them in Ruby, while still benefiting from the same model formats and runtimes used across ecosystems.
Further Reading
Gems by @ankane:
- ankane/onnxruntime-ruby - Ruby bindings for ONNX Runtime
- ankane/tokenizers-ruby - Ruby bindings for HuggingFace Tokenizers
EmbeddingGemma Resources:
- EmbeddingGemma Model Card - Official model documentation
- Gemma Terms of Use - Licensing and usage terms
Model & Research:
- Matryoshka Representation Learning: https://arxiv.org/abs/2205.13147
- HuggingFace Tokenizers (Rust): https://github.com/huggingface/tokenizers
- ONNX Runtime Documentation: https://onnxruntime.ai/docs/
If you have questions, spot an error, or run into a bug while following this tutorial, you can reach me on X: @kskrzypinski.
Photo by Annie Spratt