Google DeepMind: Gemma 4 rollout and TurboQuant compression magic

9. Apr.
2 Min. Lesezeit

Key Takeaways:

Google DeepMind: Gemma 4 rollout and TurboQuant compression magic

Google introduced Gemma 4, a state-of-the-art model with 256K context, Open-Source on Apache 2.0 license, that can be run locally as it only needs less space:
- 31B - full reasoning, runs on a workstation
- 26B MoE - fast, activates only 3.8B params per inference
- E4B / E2B - runs on a phone, offline, real-time audio + visio
Google published their new compression algorithm TurboQuant that opens the door to new local developments and use cases:
- TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency
- How? It uses PolarQuant compression and QJL:
  - PolarQuant addresses the memory overhead problem using a completely different approach. Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”
  - QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points
- Dan Petrovic already posted that he tried it and can confirm it works: The paper's theoretical guarantees hold up completely in practice. Zero accuracy loss, zero speed loss, fraction of the memory.
Now you can build:
- Replace OpenAI/Gemini API calls especially for routine agent tasks (summarize, classify, simple replies)
- Use it for sensitive data workflows where you can't send data to a cloud API.
- Offline features on mobile devices

Sources: