10 min read ~1-2 hour project ~15 Beginner

Overview

In this example, we’re going to demonstrate how to easily embed over 4M document chunks to create a searchable index. The documents we’ll be embedding is the entire corpus of Apple’s patent literature. By the end of this guide we’ll be able to search for things like “wireless charging technology”, “biometric authentication methods”, or even complex queries like “patents related to reducing battery consumption in mobile devices” - and get relevant patent results in milliseconds.

Why Embeddings and Vector Search Matter

Traditional keyword search breaks down when users don’t know the exact terminology - searching for “making phones last longer” won’t find documents about “battery optimization.” Embeddings solve this by converting text into vectors that better capture semantic meaning, enabling search that actually understands intent rather than just matching strings. What once required teams of search quality experts can now be implemented in an afternoon for around $15, as we’ll demonstrate by making 30,000 Apple patents (split into over 4M+ document chunks) semantically searchable.

Data Source

The source for the corpus of patent documents will be Google BigQuery, where we can query for the full text and all relevant metadata with the below query. The results for this query (the base for our embeddings) can be found at this HuggingFace Dataset
We’ll store the results from that query into a Polars DataFrame using the below snippet. Polars is great for efficiently manipulating large datasets.
# Authenticate
credentials = service_account.Credentials.from_service_account_info(
    json.loads(os.environ['SERVICE_ACCOUNT_JSON'])
)
project_id = 'xxxxx'
client = bigquery.Client(project=project_id, credentials=credentials)

# Run query
query_job = client.query(query_full_text)
results = query_job.result()
patents_df = pl.from_arrow(results.to_arrow())

Document Chunking

After retrieving and storing the patent results, we’ll need to chunk all the relevant sections into smaller sections of text.

Why Chunking is Essential

Chunking is essential when working with embeddings mainly because of semantic precision, essentially meaning it is important for the reader of the results to be able to interpret them with the correct contextual meaning: Smaller chunks allow for more precise retrieval of relevant information, however there is a critical balance to strike: Too small chunks (e.g., 50-100 tokens):
  • ✅ High precision - returns exactly what matches
  • ❌ Loss of contextual meaning - “battery life” might miss “wasn’t that good” in the next sentence
  • ❌ Fragmented results - you might need to retrieve multiple chunks to get a complete idea
  • ❌ More vectors to store and search (higher costs)
Too large chunks (e.g., 2000+ tokens):
  • ✅ Rich context preserved - full patent claims or entire technical descriptions
  • ✅ Fewer vectors to manage (lower costs)
  • ❌ Diluted relevance - a chunk about “display technology” might rank poorly for “OLED” even if it contains relevant OLED information buried within
  • ❌ Multiple concepts per chunk - retrieval becomes less precise
The sweet spot (typically 200-800 tokens) depends on:
  • Your content type (patent claims are self-contained; descriptions are narrative)
  • Search intent (looking for specific facts vs. understanding concepts)
  • Embedding model characteristics (some models better preserve semantics in longer sequences)

Chunking Strategy

For this example, we’re using a simple fixed-size chunking strategy with overlap, but there are many approaches to consider:
  • Fixed-size chunking: Simple and predictable, splits text every N characters/tokens
  • Semantic chunking: Uses NLP to find natural boundaries (sentences, paragraphs, sections)
  • Recursive chunking: Hierarchically splits documents while preserving structure
  • Corpus-specific chunking: For patents, you might chunk by claims, abstract, description sections; for code, you might designate chunks by function boundaries
You can view the implementation of our chunking strategy in the collapsable below. We chose to write this ourselves for simplicity, but we know teams like Unstructured also do a great job here!
# Process patents for embedding
all_documents = []
for row in patents_df.iter_rows(named=True):
    all_documents.extend(prepare_patent_for_embedding(row))

Preparing Data for Sutro

Once we have our patent documents split into chunked sections, we can then pass them to Sutro to transform them into embeddings! Since we’re working with such a large amount of data, we’re going to take advantage of Sutro’s Datasets feature. This allows use to upload our dataset once (split across multiple files if necessary). We are also going to do some minor data partitioning and split our single dataframe into multiple ~500MB Parquet files, which will make the uploading process more robust. Lets do that first:
docs_df = pl.DataFrame({
    'doc_id': [doc['doc_id'] for doc in all_documents],
    'patent_id': [doc['patent_id'] for doc in all_documents],
    'section': [doc['section'] for doc in all_documents],
    'text': [doc['text'] for doc in all_documents],
    'metadata': [json.dumps(doc['metadata']) for doc in all_documents]
})

# Create output directory
output_dir = "apple_patents_documents_parquet"
os.makedirs(output_dir, exist_ok=True)

# Calculate rows per chunk for ~500MB files
total_size_mb = docs_df.estimated_size() / (1024 * 1024)
num_files = max(1, int(total_size_mb / 500) + 1)
rows_per_chunk = len(docs_df) // num_files + 1

# Split and save
for i in range(0, len(docs_df), rows_per_chunk):
    chunk = docs_df[i:i+rows_per_chunk]
    filename = f"{output_dir}/documents_part_{i//rows_per_chunk:03d}.parquet"
    chunk.write_parquet(filename, compression='snappy')

print(f"Saved {len(docs_df)} documents across {num_files} files to {output_dir}/")
Now we’ll create and upload to a Dataset. These two lines of code upload all the data neccasary to run our embedding job.
dataset_id = sutro.create_dataset()
sutro.upload_to_dataset(dataset_id, output_dir)

Running the Embedding Job

Once we have our dataset loaded fully, we can use it to run our embedding job. For this example, we chose to use Qwen3-Embedding-0.6B; it has a great balance of performance due to only having 595M parameters, but still performs very well on relevant tasks like retrieval and re-ranking. See the MTEB leaderboard for a more in-depth and numerical comparison. Now we’ll kick off our embedding job with the code below:
job_id = sutro.infer(
    dataset_id,
    column="text",
    model="qwen-3-embedding-0.6b",
    job_priority=1,
)

sutro.await_job_completion(job_id, obtain_results=False)
Note: We’re using job_priority=1 here. Sutro currently has a notion of job priorities, which is essentially how we designate job SLAs. Currently, we have two priorities:- Priority 0: Prototyping jobs (for small scale testing, targeted at <= 10m completion time)- Priority 1: Production oriented jobs (large scale jobs, targeted at 1hr completion time)More details about job priorities can be found here.
Once we have our job started, we wait for it to complete using await_job_completion(...). This will periodically poll for the job’s status until its complete; alternatively, we can monitor via Sutro’s web UI. We also disabled automatically fetching the results, instead, we can just download our original Dataset with the job results column appended.
results_dir='/mnt/cooper-notebook-volume/apple_patents_embeddings'
os.makedirs(results_dir, exist_ok=True)
sutro.download_from_dataset(dataset_id, output_path=results_dir)
job_results_df = pl.read_parquet(results_dir)

print(f"Total patents: {len(job_results_df)}")
print(f"Columns: {job_results_df.columns}")
# Total patents: 4039988
# Columns: ['text', 'job-76844041-b2bf-4248-9603-b7f750231b34']
If you want to play around with the embeddings yourself, the entire set can be found here on HuggingFace!

Loading into the Vector Database

Now we get to upload the 4M embeddings we just pulled down from Sutro into a vector database, so that we can search over the entire corpus. When uploading, we want to preserve all the metadata associated with each chunk, so that we can correctly attribute a retrieved vector to the right patent and section within the patent.
vector_col_name = 'job-76844041-b2bf-4248-9603-b7f750231b34'

# Combine the embeddings with docs_df containing attributing IDs and metadata
combined_df = docs_df.with_columns(
    job_results_df[vector_col_name]
)
We chose Qdrant as our vector DB of choice for this example; its performant and easy enough to get started with. However, there are many options out there that are well adapted to different use cases, notable ones being:
  • TurboPuffer - Great for multi-tenant architectures with many tenants
  • Chroma - Simple and developer-friendly
  • pgvector - If you’re already using PostgreSQL
Since we have such a large dataset, we want to upload using batches:
vector_col_name = 'job-76844041-b2bf-4248-9603-b7f750231b34'
# We're using an im-memory DB here for convenience, but Qdrant Cloud
# can be faster to upload to and will persist the embeddings as well
client = QdrantClient(":memory:")
collection_name = "apple_patents_collection"

# Create the collection, inferring vector size from the first
# row of the DataFrame
client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=len(combined_df[vector_col_name][0]),
        distance=models.Distance.COSINE
    ),
)

BATCH_SIZE = 8192

# Loop through the DataFrame in batches
for i in tqdm(range(0, len(combined_df), BATCH_SIZE), desc="Uploading to Qdrant"):
    batch_df = combined_df.slice(i, BATCH_SIZE)

    points = [
        models.PointStruct(
            id=row['doc_id'],
            vector=row[vector_col_name],
            payload={
                'patent_id': row['patent_id'],
                'section': row['section'],
                'text': row['text'],
                **json.loads(row['metadata'])
            }
        ) for row in batch_df.to_dicts()
    ]

    client.upload_points(
        collection_name=collection_name,
        points=points,
        wait=True,
        parallel=4
    )

print(f"Finished uploading all {len(combined_df)} points.")

Searching Your Patents

Now that we have our embeddings loaded, we can search over them!
def search_patents(query_text, top_k=5):
    # Pick your real time provider of choice, we have heard great
    # things (latency & consistency wise) about Vertex, but Baseten
    # is the easier choice
    # https://www.baseten.co/library/qwen3-06b-embedding/
    query_embedding = real_time_api(
        text=query_text,
        model="qwen-3-embedding-0.6b"
    )

    # Search in Qdrant
    results = client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=top_k
    )

    for result in results:
        print(f"Patent: {result.payload['patent_id']}")
        print(f"Section: {result.payload['section']}")
        print(f"Score: {result.score:.3f}")
        print(f"Text: {result.payload['text'][:200]}...")
        print("-" * 80)

# Let's test it out!
search_patents("wireless charging efficiency improvements")

Example Queries We Tried

wireless charging efficiency improvements

biometric authentication using facial recognition

battery thermal management

haptic feedback for touch interfaces

Interestingly, none of the results for our queries seem very good! We imagine that this is mainly due to a few things.
  1. The language used in our queries is very different from the langauge used in the patent documents, so the similarity between the query-document pairs is generally not great. There are well known fixes to this problem, commonly HyDE is used to generate queries that are more similar to langauge in the real document, and thus retrieve better results for the same source query.
  2. Under retrieving: we’re currently only retrieving the first 5 documents, which is not very many; if we retrieved more documents, its likely we’d have more relevant snippets in our results.
  3. Not reranking: Combining a higher top_k with a re-ranking step can lead to the finding the most relevant set documents. These two techniques used together can prove to be very powerful and is common with many folks we talk to who use vector search in production.

Scale, Cost & Speed Breakdown

Scale

  • Chunk count: 4.04M
  • Input token count: 879.5M

Cost Breakdown

  • BigQuery query: ~$6
  • Sutro embedding generation: $8.80
  • Total: ~$14.80

Time

  • Job completion time: 44 minutes

Conclusion

In this guide, we’ve demonstrated how Sutro makes it trivial to:
  • Go from source documents to a searchable index in under 2 hours
  • Generate high-quality embeddings using state-of-the-art models
  • Build a semantic search system that can easily be productionized
The entire pipeline - from data extraction to searchable index - can be run using a Jupyter notebook and cost under $20. Sutro handled all the worker fan out, inference, and fault tolerance automatically.

Next Steps

  • Try different embedding models for your case
  • Experiment with different techniques to improve retrieval quality
    • Hybrid search (combining embeddings with keyword search)
    • HyDE
    • Over retrieval and re-ranking
  • Productionize this workflow as part of an event driven pipeline that creates new indices for every X event (say a new user signing up)

Resources