top of page

DEJAN: Chrome’s Chunking & Embeddings Engine

  • Autorenbild: th3s3rp4nt
    th3s3rp4nt
  • 21. Aug. 2025
  • 2 Min. Lesezeit

Aktualisiert: 13. Sept. 2025

Dan Petrovic analyzed Google Chrome and found the exact chunking mechanism and embedding logic used, now being able to to browse, search and cluster his own search history through decoded vector embeddings.


Key Takeaways:

  • In August 2024, Google officially announced Chrome’s AI-powered history search feature, allowing users to find previously visited pages

  • Analysis of the Chromium source code reveals the intricate technical details of this system, from its document processing algorithms to its vector storage mechanisms.

  • Web pages are chunked from nodes/segments into passages, transformed and processed as vector embeddings in Chrome Browsers directly on the clients device

  • The findings indicate: Semantic HTML as well as clearly structured and concise paragraphs are essential for algorithms to understand/summarize the web pages content well (and include all paragraphs as relevant)


Details:

  • The Foundation: DocumentChunker Algorithm: a recursive tree-walking algorithm that processes the DOM structure of web pages into a default max of 200 words per passage (third_party/blink/renderer/modules/content_extraction/document_chunker.h)

  • Embedding generation: The embedding process converts each text passage into a 1540-dimensional vector using Google’s proprietary embedding models. These vectors capture semantic meaning in a high-dimensional space, enabling similarity searches that go beyond simple keyword matching.

    The generated embeddings are stored in Chrome’s history database within a specialized embeddings_blob field. This storage mechanism uses several layers of optimization: the embeddings are first serialized using Protocol Buffers, then compressed using gzip compression, and finally encrypted using Chrome’s OS-level encryption services before being written to the SQLite database.

    Embeddings are indexed by URL ID and visit ID, enabling efficient retrieval during search operations. The system maintains a separate passages table that stores original text content alongside references to corresponding embeddings.

  • Chromes "Answerer" for browsing ones search history: a system represents a form of personalized retrieval-augmented generation (RAG), where the user’s own browsing history serves as the knowledge base. (The Answerer system works by first identifying relevant passages through embedding similarity search, then aggregating these passages to meet a minimum word count threshold (set to 1000 words by default). This aggregated content serves as context for generating comprehensive answers to user queries.)



Key Feature

What for?

Limit (default)

max_words_per_aggregate_passage

Controls passage length

200

max_passages_per_page

Limits passages per page

30

search_passage_minimum_word_count

Minimum pasage length

5

passage_extraction_delay

Delay after page load to get fully loaded DOM

5000 ms

ml_answerer_min_score

Minimum confidence for AI answers

content_visibility_threshold

Safety filtering threshold


search_score_threshold

Relevance threshold for search results



Further fyi:

  • "Chrome’s latest update incorporates a new text embedding model that is 57% smaller (35.14MB vs 81.91MB) than its predecessor while maintaining virtually identical performance in semantic search tasks.

  • The size reduction was achieved primarily through quantization of the embedding matrix from float32 to int8 precision, with no measurable degradation in embedding quality or search ranking."



© 2026 David Epding.            Erstellt mit Wix.com.

david epding logo

David Epding ist GEO & SEO, Data Analytics und Automation Manager mit über 10 Jahren Erfahrung in Technischem SEO mit breiter Expertise für LLMs und langjähriger Erfahrung in der Daten-Analyse.

bottom of page