Jan 14, 2026

Text Embedding in Ruby

Text embeddings are the building block behind “semantic” features: search that works even when users don’t type the exact keywords, clustering similar documents, and quick similarity checks.

In this post, we’ll run Google’s EmbeddingGemma 300M locally in Ruby using ONNX Runtime (native) and Hugging Face Tokenizers (native). The goal isn’t to re-teach Ruby basics - it’s to show the few places people usually get surprised: model lifecycle/memory, task-specific prompting, and the small bits of vector math that matter in production.

One quick note before we start: this isn’t a “Ruby vs Python vs $LANG” post.

Python is excellent for research and experimentation.
Java/Go/Node are common in production services.
Ruby shines when you want to ship product fast with a small surface area.

ONNX is useful precisely because it’s language-agnostic: you can keep the model format stable and choose the runtime that fits your app. Here we’re optimizing for a very specific scenario: you already have a Ruby codebase (probably with Rails, Sidekiq workers etc.) and you want embeddings without adding a second stack.

About EmbeddingGemma

EmbeddingGemma is Google’s state-of-the-art embedding model built from the Gemma 3 family. It’s designed for a wide range of semantic tasks including retrieval, classification, clustering, and similarity matching.

Key characteristics:

300M parameters - Compact enough for production deployment
768-dimensional embeddings - Rich semantic representations
100+ languages supported - Multilingual capabilities
Maximum input context length of 2048 tokens
Open license - Provided with open weights for responsible commercial use (Gemma License)

What You’ll Build

By the end of this guide, you’ll have a working Ruby implementation that can:

Convert text to 768-dimensional semantic embeddings
Compute similarity between documents
Rank documents by semantic relevance
Reduce embedding dimensions efficiently (Matryoshka Representation Learning)
Handle batch processing with automatic padding

Understanding the Components

Let’s break down what we need to build:

┌─────────────────────────────────────────────────────────────────┐
│                         Your Application                        │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EmbeddingGemma Class                         │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐  │
│  │   Model     │  │  Tokenizer   │  │  Similarity & Ranking  │  │
│  │  Loader     │  │   Wrapper    │  │       Functions        │  │
│  └─────────────┘  └──────────────┘  └────────────────────────┘  │
└───────────────────────────────┬─────────────────────────────────┘
                                │
                    ┌───────────┴───────────┐
                    ▼                       ▼
        ┌───────────────────┐   ┌─────────────────────┐
        │  ONNX Runtime     │   │  HuggingFace        │
        │  (C++ bindings)   │   │  Tokenizers (Rust)  │
        └───────────────────┘   └─────────────────────┘

The rest of this article focuses on the “thin waist” between Ruby and native ML libraries:

onnxruntime (ankane/onnxruntime-ruby) - ONNX Runtime bindings
tokenizers (ankane/tokenizers-ruby) - Hugging Face Tokenizers bindings

We’ll build this layer by layer, starting with the foundation.

Step 1: Setting Up the Project

First, create a new directory for your project and set up the gem dependencies. If you’re adding this to an existing app, the only thing that really matters is: these gems include native components. That usually means you want reproducible versions (lockfile) and a CI image that can compile native extensions.

mkdir ruby-embeddings
cd ruby-embeddings

# Initialize bundle
bundle init

# Add required gems
bundle add onnxruntime tokenizers
bundle add minitest -group=test

Your Gemfile should now look like:

# frozen_string_literal: true

source "https://rubygems.org"

gem "onnxruntime", "~> 0.10.1"
gem "tokenizers", "~> 0.6.3"
gem "minitest", "~> 6.0", group: :test

If you hit build errors installing native gems, it’s not “Ruby being bad” - it’s the reality of binding to fast C++/Rust libraries. Fix it once in your CI image, then forget it.

Step 2: Downloading the Model from Hugging Face

Before we can use the embedding model, we need to download it from Hugging Face. The model is hosted at: https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX

Required Files

You need to download these files from the repository:

File	Size	Description
`onnx/model_q4.onnx`	~507 KB	Main ONNX model (Q4 quantized)
`onnx/model_q4.onnx_data`	~188 MB	Model weights data
`tokenizer.json`	~19 MB	Tokenizer vocabulary and merges
`tokenizer_config.json`	~1.1 MB	Tokenizer configuration
`config.json`	~2 KB	Model metadata

Total size: ~209 MB

For local development, Git LFS is usually the easiest. For automation (CI, containers, repeatable builds), a script you can run idempotently is nicer.

Download Method 1: Using Git LFS

Note: the onnx/ directory in this repo contains multiple exported variants, so a full Git LFS clone can be several GB. If you only want the small Q4 model + tokenizer, the script below downloads just the files you need.

# Install git-lfs if not already installed
# Ubuntu/Debian: sudo apt install git-lfs
# macOS: brew install git-lfs

git lfs install
git clone https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX

# The model will be in the embeddinggemma-300m-ONNX directory

Download Method 2: Using a Ruby Script (Reproducible)

If you want something you can commit (and rerun in CI), a tiny Ruby downloader is hard to beat.

Practical notes:

Hugging Face may redirect downloads (the script follows redirects).
In CI, you might want to cache the downloaded models/ directory to avoid pulling ~200MB on every run.
If you’re shipping a gem/app, think about whether you want to bundle weights, download at install time, or download at first run.

# download_model.rb
require 'net/http'
require 'fileutils'
require 'uri'

BASE_URL = "https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/resolve/main"
MODEL_DIR = "models/embeddinggemma"

FILES = %w[
  onnx/model_q4.onnx
  onnx/model_q4.onnx_data
  tokenizer.json
  tokenizer_config.json
  config.json
]

FileUtils.mkdir_p(MODEL_DIR)
FileUtils.mkdir_p("#{MODEL_DIR}/onnx")

FILES.each do |file|
  dest = File.join(MODEL_DIR, file)

  if File.exist?(dest)
    puts "✓ Already exists: #{file}"
    next
  end

  puts "Downloading #{file}..."
  url = "#{BASE_URL}/#{file}"

  uri = URI(url)
  Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
    request = Net::HTTP::Get.new(uri)

    http.request(request) do |response|
      case response
      when Net::HTTPRedirection
        # Follow HuggingFace redirects
        new_uri = URI(response['location'])
        unless new_uri.absolute?
          new_uri = URI("#{uri.scheme}://#{uri.host}#{response['location']}")
        end
        Net::HTTP.start(new_uri.host, new_uri.port, use_ssl: true) do |redirect_http|
          redirect_response = redirect_http.request(Net::HTTP::Get.new(new_uri))
          File.open(dest, 'wb') { |f| f.write(redirect_response.body) }
        end
      when Net::HTTPSuccess
        File.open(dest, 'wb') { |f| f.write(response.body) }
      else
        raise "Failed to download: #{response.code} #{response.message}"
      end
    end
  end

  size = (File.size(dest).to_f / 1024 / 1024).round(2)
  puts "  ✓ Downloaded #{file} (#{size} MB)"
end

puts "\n✓ Download complete!"
puts "Model files located in: #{MODEL_DIR}"

Run the script:

ruby download_model.rb

Organizing Your Files

After downloading, your project structure should look like:

ruby-embeddings/
├── download_model.rb
├── Gemfile
├── Gemfile.lock
└── models/
  └── embeddinggemma/
    ├── config.json
    ├── onnx/
    │   ├── model_q4.onnx
    │   └── model_q4.onnx_data
    ├── tokenizer_config.json
    └── tokenizer.json

The structure of models mirrors the Hugging Face repository, keeping the ONNX files in the onnx/ subdirectory.

Model Variants

The onnx/ folder currently contains multiple exported variants. You only need one matching pair (.onnx + .onnx_data). In broad strokes:

model_q4.onnx + model_q4.onnx_data (~197 MB weights) — smallest, good default for CPU
model_fp16.onnx + model_fp16.onnx_data (~617 MB weights) — larger, FP16 variant
model.onnx + model.onnx_data (~1.23 GB weights) — largest “full” export

You may also see additional variants like model_q4f16.onnx, model_no_gather_q4.onnx, or model_quantized.onnx intended for different runtimes or compatibility. If you’re just getting started, stick with model_q4.onnx.

Step 3: Understanding ONNX Models

ONNX is the “shipping container” format for models. The big win is practical: we can run a high-quality embedding model in a Ruby process without introducing a second runtime just to host model inference. That doesn’t make other stacks “worse” - it just keeps your deployment simpler when Ruby is already your home base.

The trade-off is that you, the application developer, now own a few low-level concerns that higher-level frameworks often hide:

Inputs must match the model’s expected shapes and dtypes.
Model initialization is expensive and involves native resources.
You need to be intentional about lifecycle (threads/processes, caching, warmup).

The flow is still simple:

Text → Tokenizer → input_ids + attention_mask → ONNX Runtime → embedding vector

Before we write any wrapper code, inspect the model’s inputs/outputs. This saves you from “guess and debug” later:

# explore_model.rb
require 'onnxruntime'

# Point to your downloaded model
model_path = 'models/embeddinggemma/onnx/model_q4.onnx'
model = OnnxRuntime::Model.new(model_path)

puts "Model Inputs:"
model.inputs.each do |input|
  puts "  Name: #{input[:name]}"
  puts "  Type: #{input[:type]}"
  puts "  Shape: #{input[:shape].inspect}"
end

puts "\nModel Outputs:"
model.outputs.each do |output|
  puts "  Name: #{output[:name]}"
  puts "  Type: #{output[:type]}"
  puts "  Shape: #{output[:shape].inspect}"
end

Typical output for embeddinggemma model:

Model Inputs:
  Name: input_ids
  Type: tensor(int64)
  Shape: ["batch_size", "sequence_length"]

  Name: attention_mask
  Type: tensor(int64)
  Shape: ["batch_size", "total_sequence_length"]

Model Outputs:
  Name: last_hidden_state
  Type: tensor(float)
  Shape: ["batch_size", "sequence_length", 768]

  Name: sentence_embedding
  Type: tensor(float)
  Shape: ["batch_size", 768]

Two gotchas worth calling out early:

Many models expect int64 inputs (Ruby integers are fine, but don’t accidentally pass floats).
Loading OnnxRuntime::Model.new(...) is not a cheap object allocation - it maps/parses model data and allocates native resources. Treat it like opening a database connection: do it once, then reuse.

Step 4: Building the Tokenizer

The tokenizer step is mostly plumbing, but it’s also where a lot of subtle bugs sneak in. The model wants a rectangular batch: every input needs the same sequence length, with an attention_mask telling the model which tokens are “real” vs padding.

This concept exists in every ecosystem (Python, JS, Rust): tokenization is deterministic, but the defaults differ. When results look weird, it’s often because padding/truncation differed between “how you think it works” and “what your tokenizer actually did”.

We’ll use the tokenizers gem (Rust under the hood) so tokenization isn’t your performance bottleneck.

# lib/tokenizer.rb
require "tokenizers"

class Tokenizer
  attr_reader :tokenizer

  def initialize(tokenizer_path)
    @tokenizer = Tokenizers::Tokenizer.from_file(tokenizer_path)

    # Enable padding for this tokenizer
    @tokenizer.enable_padding
  end

  def tokenize(texts)
    texts = [ texts ] unless texts.is_a?(Array)

    # Encode all texts with padding
    result = @tokenizer.encode_batch(texts)

    # Extract input_ids and attention_mask
    input_ids = result.map(&:ids)
    attention_mask = result.map(&:attention_mask)

    {
      input_ids: input_ids,
      attention_mask: attention_mask
    }
  end
end

Tricky cases to keep in mind:

Batching: encoding one string at a time is dramatically slower than encode_batch.
Padding strategy: you generally want “pad to longest in batch”, not “pad to max length” for every request.
Long inputs: if you ingest untrusted text (web pages, logs), decide up front whether to truncate or reject inputs past the model’s context window.

Step 5: Building the Core Embedding Class

Now let’s build the main embedding class that ties everything together:

# lib/embedding_model.rb
require_relative 'tokenizer'
require 'onnxruntime'

class EmbeddingModel
  EMBEDDING_DIM = 768
  DEFAULT_MODEL_DIR = 'models/embeddinggemma'

  PREFIXES = {
    query: "task: search result | query: ",
    document: "title: none | text: ",
    classification: "task: classification | query: ",
    clustering: "task: clustering | query: ",
    qa: "task: question answering | query: ",
    similarity: "task: sentence similarity | query: "
  }.freeze

  attr_reader :model, :tokenizer

  def initialize(model_dir: DEFAULT_MODEL_DIR)
    # Load ONNX model
    @model_path = File.join(model_dir, 'onnx/model_q4.onnx')
    @model = OnnxRuntime::Model.new(@model_path)

    # Cache input/output names
    @input_names = @model.inputs.map { |i| i[:name] }
    @output_names = @model.outputs.map { |o| o[:name] }

    # Initialize tokenizer
    tokenizer_path = File.join(model_dir, 'tokenizer.json')
    @tokenizer = Tokenizer.new(tokenizer_path)
  end

  # Generate embeddings for one or more texts
  def embed(texts, task: :query)
    texts = [texts] unless texts.is_a?(Array)
    return [] if texts.empty?

    # Add task-specific prefix
    prefix = PREFIXES[task] || PREFIXES[:query]
    prefixed_texts = texts.map { |t| prefix + t }

    # Tokenize
    tokens = @tokenizer.tokenize(prefixed_texts)

    # Prepare input for ONNX model
    inputs = {
      @input_names[0] => tokens[:input_ids],
      @input_names[1] => tokens[:attention_mask]
    }

    # Run inference
    outputs = @model.predict(inputs)

    # Extract embeddings
    extract_embedding(outputs)
  end

  # Generate embedding for single text
  def embed_single(text, task: :query)
    embed(text, task: task).first
  end

  private

  def extract_embedding(outputs)
    embedding = if outputs.key?("sentence_embedding")
      outputs["sentence_embedding"]
    else
      outputs[@output_names.first]
    end

    embedding
  end
end

Prompt Instructions for EmbeddingGemma

EmbeddingGemma uses task-specific prompt prefixes to optimize embeddings for different use cases. These prefixes are part of the model’s training and significantly improve performance. Here’s what happens internally:

# For a search query:
model.embed_single("machine learning", task: :query)
# Internally becomes: "task: search result | query: machine learning"

# For indexing a document:
model.embed_single("Ruby is a programming language", task: :document)
# Internally becomes: "title: none | text: Ruby is a programming language"

Available task types:

Task	Prefix	Use Case
`:query`	`task: search result \| query:`	Search/retrieval queries
`:document`	`title: none \| text:`	Indexing documents without titles
`:classification`	`task: classification \| query:`	Text classification
`:clustering`	`task: clustering \| query:`	Document clustering
`:qa`	`task: question answering \| query:`	Question answering
`:similarity`	`task: sentence similarity \| query:`	Semantic similarity

Always use the appropriate task type for best results. Using query embeddings for documents (or vice versa) will produce suboptimal similarity scores.

This is one of those “it works but it’s wrong” failure modes. In a real app, it shows up as inconsistent search results that are hard to debug because nothing crashes.

For more details, see the EmbeddingGemma Model Card.

Step 6: Adding Model Caching

Loading ONNX models is expensive - and doing it per request is one of the fastest ways to turn a “works on my laptop” demo into a production outage.

In Ruby web servers like Puma, multiple threads can try to initialize the model at the same time. Without a lock, you can end up loading the same model multiple times concurrently (slow) and spiking memory (dangerous). A Mutex around the cache prevents that.

One more lifecycle detail that trips people up: a cache is per-process. If you run 4 Puma workers, you will likely have 4 separate model instances in memory (one per worker). That can be totally fine - just budget for it. If memory is tight, prefer fewer processes with more threads, or run embeddings in a dedicated worker/service where you control concurrency.

Let’s add class-level caching:

# lib/embedding_model.rb (updated)
class EmbeddingModel
  # ... existing code ...

  # Class-level cache
  @model_cache = {}
  @cache_mutex = Mutex.new

  class << self
    def cached_model(model_path)
      @cache_mutex.synchronize do
        @model_cache[model_path] ||= OnnxRuntime::Model.new(model_path)
      end
    end

    def clear_cache
      @cache_mutex.synchronize do
        @model_cache.clear
      end
    end
  end

  # Update initialize to use cached model
  def initialize(model_dir: DEFAULT_MODEL_DIR)
    @model_path = File.join(model_dir, "onnx/model_q4.onnx")
    @model = self.class.cached_model(@model_path)

    @input_names = @model.inputs.map { |i| i[:name] }
    @output_names = @model.outputs.map { |o| o[:name] }

    tokenizer_path = File.join(model_dir, "tokenizer.json")
    @tokenizer = Tokenizer.new(tokenizer_path)
  end
end

Step 7: Implementing Similarity Functions

Now let’s add methods to compute similarity between embeddings:

Most embedding systems use cosine similarity because embeddings are meaningful primarily by direction (semantic “angle”), not absolute magnitude. If you use Euclidean distance on raw vectors, results can be skewed by vector norms.

One more practical note: if you ever truncate embeddings (MRL) or apply your own pooling, re-normalize afterwards - otherwise cosine similarity becomes inconsistent.

When you start debugging relevance, it helps to remember what cosine similarity looks like in practice:

The mathematical range is [-1, 1].
For many modern text embedding models, most “unrelated” pairs land near 0, and related pairs drift upward.
Don’t overfit to a magic threshold across domains - always evaluate on your own data.

# lib/embedding_model.rb (add these methods)

  # Compute cosine similarity between two embeddings
  # Returns: Float between -1 and 1 (typically 0 to 1 for embeddings)
  def cosine_similarity(vec1, vec2)
    # Same vector optimization
    return 1.0 if vec1.equal?(vec2)

    # Compute dot product and norms
    dot_product = 0.0
    norm1_sq = 0.0
    norm2_sq = 0.0

    vec1.each_with_index do |v1, i|
      v2 = vec2[i]
      dot_product += v1 * v2
      norm1_sq += v1 * v1
      norm2_sq += v2 * v2
    end

    # Handle zero vectors
    return 0.0 if norm1_sq.zero? || norm2_sq.zero?

    # Cosine similarity: dot / (norm1 * norm2)
    dot_product / Math.sqrt(norm1_sq * norm2_sq)
  end

  # Compute similarities between query and multiple documents
  def similarities(query_embedding, document_embeddings)
    document_embeddings.map do |doc_emb|
      cosine_similarity(query_embedding, doc_emb)
    end
  end

  # Rank documents by relevance to query
  # Returns: Array of hashes with index, document, and score
  def rank(query, documents)
    # Generate embeddings with appropriate tasks
    query_emb = embed_single(query, task: :query)
    doc_embs = embed(documents, task: :document)

    # Calculate similarity scores
    scores = similarities(query_emb, doc_embs)

    # Build results
    results = scores.each_with_index.map do |score, idx|
      {
        index: idx,
        document: documents[idx],
        score: score
      }
    end

    # Sort by score (highest first)
    results.sort_by { |r| -r[:score] }
  end

Step 8: Implementing Matryoshka Representation Learning

One of the most useful “systems” features of modern embedding models is MRL: you can truncate embeddings to smaller dimensions and still get decent results. This is a real lever in production: smaller vectors mean less RAM, less network bandwidth, and cheaper storage - at the cost of some retrieval quality.

This is also a nice example of “ML meets systems”: the model work was done upstream, but you (application dev) decide how to spend the budget. If you’re storing millions of vectors, dropping from 768 → 256 dims is often the difference between “fits in Postgres + pgvector” and “needs a dedicated vector store”.

# lib/embedding_model.rb (add this method)

  # Truncate embedding to smaller dimension using MRL
  # Dimensions: 768 (full) -> 512 -> 256 -> 128
  # Accuracy loss: 0 -> -0.44 -> -1.47 -> -2.92 points
  def truncate_embedding(embedding, dim: 256)
    raise ArgumentError, "dim must be <= #{EMBEDDING_DIM}" if dim > EMBEDDING_DIM

    # Return as-is if no truncation needed
    return embedding.dup if dim >= embedding.length

    # Truncate to first N dimensions
    truncated = embedding[0, dim]

    # Re-normalize (truncation changes the norm)
    norm_sq = truncated.sum { |x| x * x }
    return truncated if norm_sq.zero?

    norm = Math.sqrt(norm_sq)
    truncated.map { |x| x / norm }
  end

Why re-normalize after truncation?

Cosine similarity depends on vector norms. When we truncate, we change the norm, which affects similarity calculations. Re-normalizing ensures consistent similarity scores regardless of dimension.

Step 9: Putting It All Together

Here’s the complete implementation for the EmbeddingGemma 300M model. Keep in mind that this is tutorial code which certainly does not cover all aspects of embedding or edge cases.

# lib/embedding_model.rb
# frozen_string_literal: true

require_relative "tokenizer"
require "onnxruntime"

class EmbeddingModel
  EMBEDDING_DIM = 768
  DEFAULT_MODEL_DIR = "models/embeddinggemma"

  PREFIXES = {
    query: "task: search result | query: ",
    document: "title: none | text: ",
    classification: "task: classification | query: ",
    clustering: "task: clustering | query: ",
    qa: "task: question answering | query: ",
    similarity: "task: sentence similarity | query: "
  }.freeze

  @model_cache = {}
  @cache_mutex = Mutex.new

  class << self
    def cached_model(model_path)
      @cache_mutex.synchronize do
        @model_cache[model_path] ||= OnnxRuntime::Model.new(model_path)
      end
    end

    def clear_cache
      @cache_mutex.synchronize do
        @model_cache.clear
      end
    end
  end

  attr_reader :model, :tokenizer

  def initialize(model_dir: DEFAULT_MODEL_DIR)
    @model_path = File.join(model_dir, "onnx/model_q4.onnx")
    @model = self.class.cached_model(@model_path)
    @input_names = @model.inputs.map { |i| i[:name] }
    @output_names = @model.outputs.map { |o| o[:name] }

    tokenizer_path = File.join(model_dir, "tokenizer.json")
    @tokenizer = Tokenizer.new(tokenizer_path)
  end

  def embed(texts, task: :query)
    texts = [ texts ] unless texts.is_a?(Array)
    return [] if texts.empty?

    prefix = PREFIXES[task] || PREFIXES[:query]
    prefixed_texts = texts.map { |t| prefix + t }

    tokens = @tokenizer.tokenize(prefixed_texts)

    inputs = {
      @input_names[0] => tokens[:input_ids],
      @input_names[1] => tokens[:attention_mask]
    }

    outputs = @model.predict(inputs)
    extract_embedding(outputs)
  end

  def embed_single(text, task: :query)
    embed(text, task: task).first
  end

  def cosine_similarity(vec1, vec2)
    return 1.0 if vec1.equal?(vec2)

    dot_product = 0.0
    norm1_sq = 0.0
    norm2_sq = 0.0

    vec1.each_with_index do |v1, i|
      v2 = vec2[i]
      dot_product += v1 * v2
      norm1_sq += v1 * v1
      norm2_sq += v2 * v2
    end

    return 0.0 if norm1_sq.zero? || norm2_sq.zero?
    dot_product / Math.sqrt(norm1_sq * norm2_sq)
  end

  def similarities(query_embedding, document_embeddings)
    document_embeddings.map do |doc_emb|
      cosine_similarity(query_embedding, doc_emb)
    end
  end

  def rank(query, documents)
    query_emb = embed_single(query, task: :query)
    doc_embs = embed(documents, task: :document)

    scores = similarities(query_emb, doc_embs)

    scores.each_with_index.map do |score, idx|
      { index: idx, document: documents[idx], score: score }
    end.sort_by { |r| -r[:score] }
  end

  def truncate_embedding(embedding, dim: 256)
    raise ArgumentError, "dim must be <= #{EMBEDDING_DIM}" if dim > EMBEDDING_DIM
    return embedding.dup if dim >= embedding.length

    truncated = embedding[0, dim]
    norm_sq = truncated.sum { |x| x * x }
    return truncated if norm_sq.zero?

    norm = Math.sqrt(norm_sq)
    truncated.map { |x| x / norm }
  end

  private

  def extract_embedding(outputs)
    embedding = if outputs.key?("sentence_embedding")
      outputs["sentence_embedding"]
    else
      outputs[@output_names.first]
    end

    embedding
  end
end

Step 10: Building a Complete Application

Let’s build a practical but very simple semantic search application:

# app.rb

require_relative "lib/embedding_model"

puts "Loading model..."
model = EmbeddingModel.new # Uses default model_dir: "models/embeddinggemma"
puts "Model loaded!\n\n"

# Example 1: Semantic Search
puts "=" * 60
puts "Example 1: Semantic Search"
puts "=" * 60

documents = [
  "Ruby is a dynamic, open source programming language created by Yukihiro Matsumoto",
  "Python is a high-level programming language popular in data science",
  "JavaScript is primarily used for web development",
  "Rails is a web framework written in Ruby",
  "Queen likes wearing jewelry with small ruby gems"
]

query = "Tell me about Ruby programming language"
puts "Query: #{query}\n\n"

results = model.rank(query, documents)

puts "Results:"
results.each_with_index do |result, i|
  puts "#{i + 1}. [#{result[:score].round(3)}] #{result[:document]}"
end

# Example 2: Document Similarity
puts "\n" + "=" * 60
puts "Example 2: Document Similarity"
puts "=" * 60

text1 = "Ruby gems are packages for Ruby libraries"
text2 = "Ruby packages are called gems"
text3 = "Python uses pip for package management"
text4 = "Cat like to watch fish swimming in the aquarium"

emb1 = model.embed_single(text1)
emb2 = model.embed_single(text2)
emb3 = model.embed_single(text3)
emb4 = model.embed_single(text4)

sim12 = model.cosine_similarity(emb1, emb2)
sim13 = model.cosine_similarity(emb1, emb3)
sim14 = model.cosine_similarity(emb1, emb4)

puts "Text 1: #{text1}"
puts "Text 2: #{text2}"
puts "Text 3: #{text3}"
puts "Text 4: #{text4}"
puts "\nSimilarity (1, 2): #{sim12.round(4)}"
puts "Similarity (1, 3): #{sim13.round(4)}"
puts "Similarity (1, 4): #{sim14.round(4)}"

# Example 3: Efficient Storage with MRL
puts "\n" + "=" * 60
puts "Example 3: Matryoshka Representation Learning"
puts "=" * 60

text = "Machine learning is a subset of artificial intelligence"
full_emb = model.embed_single(text)

puts "Full embedding: #{full_emb.size} dimensions"

[ 512, 256, 128 ].each do |dim|
  truncated = model.truncate_embedding(full_emb, dim: dim)
  puts "Truncated to #{dim} dimensions: #{truncated.size} dimensions"
end

# Example 4: Building a Simple Recommender
puts "\n" + "=" * 60
puts "Example 4: Document Recommender"
puts "=" * 60

class DocumentRecommender
  def initialize(documents, model)
    @model = model
    @documents = documents
    @embeddings = model.embed(documents, task: :document)
  end

  def recommend(query, top_k: 3)
    query_emb = @model.embed_single(query, task: :query)

    scores = @embeddings.map.with_index do |doc_emb, idx|
      {
        document: @documents[idx],
        score: @model.cosine_similarity(query_emb, doc_emb)
      }
    end

    scores.sort_by { |s| -s[:score] }.first(top_k)
  end
end

corpus = [
  "Ruby on Rails is a web application framework",
  "Sinatra is a lightweight Ruby web framework",
  "Django is a Python web framework",
  "Flask is a microframework for Python"
]

recommender = DocumentRecommender.new(corpus, model)

puts "Query: 'Ruby web development'\n\n"
recommendations = recommender.recommend("Ruby web development")

recommendations.each_with_index do |rec, i|
  puts "#{i + 1}. [#{rec[:score].round(3)}] #{rec[:document]}"
end

If you run it (ruby app.rb), you’ll see output like:

Loading model...
Model loaded!

============================================================
Example 1: Semantic Search
============================================================
Query: Tell me about Ruby programming language

Results:
1. [0.622] Ruby is a dynamic, open source programming language created by Yukihiro Matsumoto
2. [0.505] Rails is a web framework written in Ruby
3. [0.278] Queen likes wearing jewelry with small ruby gems
4. [0.254] JavaScript is primarily used for web development
5. [0.247] Python is a high-level programming language popular in data science

============================================================
Example 2: Document Similarity
============================================================
Text 1: Ruby gems are packages for Ruby libraries
Text 2: Ruby packages are called gems
Text 3: Python uses pip for package management
Text 4: Cat like to watch fish swimming in the aquarium

Similarity (1, 2): 0.9161
Similarity (1, 3): 0.4411
Similarity (1, 4): 0.079

============================================================
Example 3: Matryoshka Representation Learning
============================================================
Full embedding: 768 dimensions
Truncated to 512 dimensions: 512 dimensions
Truncated to 256 dimensions: 256 dimensions
Truncated to 128 dimensions: 128 dimensions

============================================================
Example 4: Document Recommender
============================================================
Query: 'Ruby web development'

1. [0.619] Ruby on Rails is a web application framework
2. [0.55] Sinatra is a lightweight Ruby web framework
3. [0.368] Django is a Python web framework

How to read this:

Semantic search (Example 1): the “jewelry” sentence contains the token ruby, so it shows up (3rd), but it scores far below the actual programming-language results. This is the kind of ambiguity you want in your smoke test because it validates the embeddings are capturing meaning, not just keyword overlap.
Similarity (Example 2): paraphrases usually land close (0.9161). A nearby-but-different topic can still be moderately similar (0.4411). Truly unrelated text should be near 0 (0.079). Treat the absolute values as heuristics — what you really care about is separation and ordering.
MRL truncation (Example 3): the model gives you 768 dims, but you can store 256/128 dims for cheaper indexes. Expect some recall loss; the point is you can choose the trade-off per index/tier.
Recommender (Example 4): pre-embedding the corpus with task: :document keeps query-time cheap, and you can see it ranks Ruby web docs above Python web docs.

Note: exact scores can vary across model variants/quantization, CPU instruction sets, and runtime versions. Use these numbers for sanity checks and relative ranking, not as a cross-machine “golden” baseline.

Step 11: Performance Optimization

Batch Processing

Always prefer batching, but keep two constraints in mind:

Batch size is a latency + memory dial. Bigger batches improve throughput but increase tail latency and can blow up RAM if your inputs are long.
Padding is real work. If you mix tiny strings and giant paragraphs in the same batch, you pay for the longest sequence.

The simple win is: batch, and cap batch size.

def embed_in_batches(model, texts, task:, batch_size: 32)
  texts.each_slice(batch_size).flat_map do |chunk|
    model.embed(chunk, task: task)
  end
end

# Slow: one model call per string
embeddings = texts.map { |t| model.embed_single(t, task: :document) }

# Better: a few bigger model calls
embeddings = embed_in_batches(model, texts, task: :document, batch_size: 32)

If you ingest highly variable text lengths (titles + long pages), a pragmatic trick is to batch by length to reduce padding waste. Don’t over-engineer it; even a coarse bucket (“short” vs “long”) helps.

Pre-compute Embeddings

If your corpus is static (docs, help center, product catalog), the fastest query is the one where you do no document-side inference.

Two practical upgrades to the “store embeddings and scan them” approach:

Store truncated vectors (e.g. 256 dims) using MRL: smaller index, faster scoring.
Normalize once, then score with a dot product: for unit vectors, cosine similarity equals dot product.

Here’s a small in-memory index that does both and avoids sorting the full corpus when you only need top-k.

class SemanticIndex
  def initialize(documents, model, dim: 256)
    @model = model
    @documents = documents
    @dim = dim

    raw = model.embed(documents, task: :document)
    @embeddings = raw.map { |e| @model.truncate_embedding(e, dim: @dim) }
  end

  def search(query, top_k: 5)
    q = @model.embed_single(query, task: :query)
    q = @model.truncate_embedding(q, dim: @dim)

    top = [] # [{ score:, index: }]

    @embeddings.each_with_index do |doc_emb, idx|
      score = dot(q, doc_emb)

      if top.length < top_k
        top << { score: score, index: idx }
        top.sort_by! { |r| r[:score] } # smallest first
        next
      end

      next if score <= top.first[:score]

      top[0] = { score: score, index: idx }
      top.sort_by! { |r| r[:score] }
    end

    top.reverse.map { |r| { score: r[:score], document: @documents[r[:index]] } }
  end

  private

  def dot(a, b)
    sum = 0.0
    a.each_with_index { |av, i| sum += av * b[i] }
    sum
  end
end

For a toy demo, keeping this in memory is fine. For anything real:

Persist embeddings (and their dim) so you don’t rebuild on every boot.
Consider a vector index (pgvector / dedicated store) once $n$ is large enough that $O(n)$ scans hurt.

The meta-rule: optimize the shape of work first (batch, cache, precompute), then optimize the math.

Step 12: Testing Your Implementation

When testing embedding code, focus on invariants. Most failures here aren’t “Ruby is broken”; they’re “the output has the right shape but the wrong semantics”.

Useful invariants to assert:

Embedding dimensionality is always 768 (or your expected dimension after truncation).
Cosine similarity is symmetric: sim(a, b) ≈ sim(b, a).
Truncated embeddings are re-normalized (unit norm) if you rely on cosine similarity.
Ranking results are well-formed and ordered: each item has index, document, score, and results are sorted by descending score.

These checks catch the most common mistakes: accidentally using last_hidden_state instead of the pooled embedding, forgetting to re-normalize after truncation, and subtle shape/ordering bugs in batching.

If you want one “semantic” smoke test, keep it coarse and deterministic: verify that a query about “Ruby programming” ranks Ruby-language documents above unrelated ones (avoid asserting exact floating-point scores across machines/versions).

Step 13: Deployment Considerations - Memory Management

Loading the ONNX model, tokenizer, and weight files on every request is expensive - it incurs repeated disk I/O, deserialization, and native resource allocation which increases latency, CPU usage, and memory churn. Cache the loaded model and tokenizer (lazy-load once), provide a controlled clear_cache method (e.g. EmbeddingModel.clear_cache) to release resources, and guard initialization with a Mutex for thread-safety. For predictable latency consider preloading at startup or using worker processes/pools under high concurrency.

Operationally, this looks the same no matter what language you use:

If you run multiple processes, expect memory usage to multiply.
If you run multiple threads, you need to think about thread-safety around initialization and shared state.
If you care about tail latency, warm up the model (load once) during boot rather than on the first user request.

There isn’t one “right” answer here - it depends on your traffic shape and budget. The important part is to make the trade-off explicit instead of letting the default server config choose for you.

class EmbeddingService
  def initialize(model_dir: 'models/embeddinggemma')
    @model_dir = model_dir
    @model = nil
  end

  def model
    @model ||= EmbeddingModel.new(model_dir: @model_dir)
  end

  def reset_model
    EmbeddingModel.clear_cache
    @model = nil
  end
end

Where to Go From Here

If you got this far, you already have the hard part: a clean bridge between Ruby and a native inference runtime. From here, the interesting work is product work and operations work.

Here are a few directions that are both useful and educational:

1. Choose an embedding “shape” for your system

If you embed a handful of strings per request: keep it in-process and cache the model.
If you embed large batches (indexing) or want isolation: run embeddings in a dedicated worker/service.
If you’re cost-optimizing storage: use MRL truncation (e.g. 256 dims) and be consistent across indexing + querying.

2. Add an evaluation loop (even a tiny one)

Collect ~20 real queries and the documents you expect to match.
Measure whether the top-k results look right.
This is more valuable than tweaking thresholds blindly.

3. Harden the pipeline

Put explicit bounds on input size.
Decide how you handle failures (retry vs fail fast).
Warm up at boot if you care about first-request latency.

4. Scale storage and retrieval

For small/medium projects, Postgres + pgvector is a great starting point.
For larger corpora or specialized needs, often with managed version of services, look at Milvus or Weaviate.
(The “right” database depends more on ops constraints than on Ruby vs any other language.)

5. Experiment with deployment options

ONNX Runtime can run on CPU very well; GPU can help when you have bigger models or higher throughput.
If you do add a remote embedding API fallback, treat it as an operational trade-off (latency, privacy, cost), not a moral one.

Key Takeaways

ONNX + Ruby is a pragmatic combo when Ruby is already your deployment target.
onnxruntime + tokenizers do the heavy lifting: Ruby stays the orchestration layer, native code does the hot path.
Task prefixes are a correctness requirement, not an optimization.
Cosine similarity is simple, but not “free”: normalization and consistency (especially with truncation) matter.
Batching and caching are the first performance wins before you reach for GPUs.

The broader point: you don’t need to pick a “one true language” for ML. You can learn the concepts and apply them in Ruby, while still benefiting from the same model formats and runtimes used across ecosystems.