vrypan.eth's cast | Paragraph

Mar 23

> During autoregressive decoding, consecutive tokens produce nearly identical KV cache values. The hidden state for "The cat sat on the mat" differs from "The cat sat on the rug" by only ~1% at most dimensions. > Standard KV cache quantization (Q4_0) compresses absolute values to 4 bits. Delta-KV compresses the tiny difference between tokens to 4 bits instead. Same bits, vastly less error. https://github.com/cenconq25/delta-compress-llm

GitHub

GitHub - cenconq25/delta-compress-llm: Exploiting temporal coherence in LLM inference-- delta encoding for KV cache compression and weight-skip prediction. Achieves F16-quality KV cache at Q4_0 compression ratios with zero perplexity loss on llama.cpp.

Exploiting temporal coherence in LLM inference-- delta encoding for KV cache compression and weight-skip prediction. Achieves F16-quality KV cache at Q4_0 compression ratios with zero perplexit...