top of page

DEJAN & Salomon: How ChatGPT sees the Web & its hidden Cache

  • Autorenbild: th3s3rp4nt
    th3s3rp4nt
  • 25. Nov. 2025
  • 3 Min. Lesezeit

Aktualisiert: 30. Nov. 2025

Key Takeaways:

  • DEJAN/Dan Petrovic analyzed how ChatGPT retrieves websites for grounding and outlined it on an example of his page using the Web Search tool in the Asisstants API.

  • How it works in a nutshell:

    1. Initially retrieves small data object from web search results (title, description, 1-3 sentences, retrieval ID)

    2. Then, it can look at certain windows/passages and even follow links It uses an open() call to access the page at a certain row and click() to initiate the same process for a url mentioned.

      (context/window size is a setting for the assitant to choose: low vs. medium vs. high)

    3. It is not able to reconstruct the full page nor long passages - by retrieval and output limits it is forced to summarize

    4. it only provides plain text - no HTML, no JSON-LD mark-up of the pages

  • Chris Long explains why SEO/GEO or rather well structured content works well:

    1. Theoretically, AI search might prioritize the top of the page. That's why including key content [at the top of the page] is important since it's more likely to be included in a window. 

    2. Structured content like bullets, lists and tables work well because of this approach. If one of the "windows" captures content in this format, it's great context that contextualizes the entire page for GPT. 

    3. Spreading structured content around the page works well too. If you have multiple structured formats sprinkled throughout the content, you're ensuring that your content as a much stronger change of getting included in one of these "windows". 

  • Jerome Salomon found that ChatGPT has a hidden Cache: when a website was visited with "external_web_access":true before it doesn't need to access it again, using "external_web_access":false



How ChatGPT grounding / page retrieval works:

  1. Initial web search result as small structured object When GPT requests a web search result, it receives a small structured object:

    1. Title

    2. URL

    3. Short text snippet (1–3 sentences)

    4. Optional metadata such as date or score

    5. A unique internal ID (turn0search0, etc.) This is all the grounding GPT gets initially.

      It does not receive:

      1. Full pages

      2. Raw HTML

      3. Full article content

      4. Site navigation or structure

  2. Dive into page sections via sliding window browsing pattern Each snippet comes with a retrieval ID. GPT can request more with:

    1. open()

      Fetches a larger slice of text from the same page, centered around a line number. This is how GPT “scrolls.”

    2. click()

      Follows an outgoing link from the snippet. The new page is fetched as another snippet, using the same rules as the original search.

  3. Demonstrated example:

    1. First snippet from web search

    2. First open() call reveals the start of the article:

      • Title

      • Date

      • First paragraph

      • Some introductory context

    3. Expanding deeper (line 30, line 60, etc.)

      Each expansion retrieves more of the page:

      • Body sections

      • Headings

      • Explanatory paragraphs

      • Lists and examples

      But still windowed.

    4. Switching to High context makes each window taller, so expansions return:

      • Longer excerpts

      • More adjacent paragraphs

      • Larger text blocks per request

      But even on High, expansions eventually hit tool caps. Each window is a plaintext extraction




Salomon's experiment design:

One of the tests I ran multiple times consisted of the following steps:

1) Request a summary of the content of a URL with “external_web_access”: false

2) Repeat with “external_web_access”: true

3) Repeat again with “external_web_access”: false

4) And a last try with “external_web_access”: true


On step 1 the page was not cached/indexed, so I didn’t get the summary.

On step 2, with web access enabled, I spotted ChatGPT-User fetching the page.

On step 3 the page was now cached/indexed.

On step 4, ChatGPT-User didn’t come back to fetch the page again.




JS Rendering of LLM Bots:


Sources:

© 2026 David Epding.            Erstellt mit Wix.com.

david epding logo

David Epding ist GEO & SEO, Data Analytics und Automation Manager mit über 10 Jahren Erfahrung in Technischem SEO mit breiter Expertise für LLMs und langjähriger Erfahrung in der Daten-Analyse.

bottom of page