GitHub
GitHub - cenconq25/delta-compress-llm: Exploiting temporal coherence in LLM inference-- delta encoding for KV cache compression and weight-skip prediction. Achieves F16-quality KV cache at Q4_0 compression ratios with zero perplexity loss on llama.cpp.
Exploiting temporal coherence in LLM inference-- delta encoding for KV cache compression and weight-skip prediction. Achieves F16-quality KV cache at Q4_0 compression ratios with zero perplexit...