DEJAN: Chrome’s Chunking & Embeddings Engine

th3s3rp4nt
21. Aug. 2025
2 Min. Lesezeit

Aktualisiert: 13. Sept. 2025

Dan Petrovic analyzed Google Chrome and found the exact chunking mechanism and embedding logic used, now being able to to browse, search and cluster his own search history through decoded vector embeddings.

Key Takeaways:

In August 2024, Google officially announced Chrome’s AI-powered history search feature, allowing users to find previously visited pages
Analysis of the Chromium source code reveals the intricate technical details of this system, from its document processing algorithms to its vector storage mechanisms.
Web pages are chunked from nodes/segments into passages, transformed and processed as vector embeddings in Chrome Browsers directly on the clients device
The findings indicate: Semantic HTML as well as clearly structured and concise paragraphs are essential for algorithms to understand/summarize the web pages content well (and include all paragraphs as relevant)

Details:

The Foundation: DocumentChunker Algorithm: a recursive tree-walking algorithm that processes the DOM structure of web pages into a default max of 200 words per passage (third_party/blink/renderer/modules/content_extraction/document_chunker.h)
Embedding generation: The embedding process converts each text passage into a 1540-dimensional vector using Google’s proprietary embedding models. These vectors capture semantic meaning in a high-dimensional space, enabling similarity searches that go beyond simple keyword matching.
The generated embeddings are stored in Chrome’s history database within a specialized embeddings_blob field. This storage mechanism uses several layers of optimization: the embeddings are first serialized using Protocol Buffers, then compressed using gzip compression, and finally encrypted using Chrome’s OS-level encryption services before being written to the SQLite database.
Embeddings are indexed by URL ID and visit ID, enabling efficient retrieval during search operations. The system maintains a separate passages table that stores original text content alongside references to corresponding embeddings.
Chromes "Answerer" for browsing ones search history: a system represents a form of personalized retrieval-augmented generation (RAG), where the user’s own browsing history serves as the knowledge base. (The Answerer system works by first identifying relevant passages through embedding similarity search, then aggregating these passages to meet a minimum word count threshold (set to 1000 words by default). This aggregated content serves as context for generating comprehensive answers to user queries.)

Key Feature	What for?	Limit (default)
max_words_per_aggregate_passage	Controls passage length	200
max_passages_per_page	Limits passages per page	30
search_passage_minimum_word_count	Minimum pasage length	5
passage_extraction_delay	Delay after page load to get fully loaded DOM	5000 ms
ml_answerer_min_score	Minimum confidence for AI answers
content_visibility_threshold	Safety filtering threshold
search_score_threshold	Relevance threshold for search results

Further fyi:

"Chrome’s latest update incorporates a new text embedding model that is 57% smaller (35.14MB vs 81.91MB) than its predecessor while maintaining virtually identical performance in semantic search tasks.
The size reduction was achieved primarily through quantization of the embedding matrix from float32 to int8 precision, with no measurable degradation in embedding quality or search ranking."

Source: https://dejan.ai/blog/inside-chromes-semantic-engine-a-technical-analysis-of-history-embeddings/