Google DeepMind: Gemma 4 rollout and TurboQuant compression magic
- 9. Apr.
- 2 Min. Lesezeit
Key Takeaways:

Google introduced Gemma 4, a state-of-the-art model with 256K context, Open-Source on Apache 2.0 license, that can be run locally as it only needs less space:
31B - full reasoning, runs on a workstation
26B MoE - fast, activates only 3.8B params per inference
E4B / E2B - runs on a phone, offline, real-time audio + visio
Google published their new compression algorithm TurboQuant that opens the door to new local developments and use cases:
TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency
How? It uses PolarQuant compression and QJL:
PolarQuant addresses the memory overhead problem using a completely different approach. Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”
QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points
Dan Petrovic already posted that he tried it and can confirm it works: The paper's theoretical guarantees hold up completely in practice. Zero accuracy loss, zero speed loss, fraction of the memory.
Now you can build:
Replace OpenAI/Gemini API calls especially for routine agent tasks (summarize, classify, simple replies)
Use it for sensitive data workflows where you can't send data to a cloud API.
Offline features on mobile devices


Sources:


