Techniques for KV Cache Optimization in Large Language Models, accessed July 20, 2025, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
A Survey on Large Language Model Acceleration based on KV Cache Management - arXiv, accessed July 20, 2025, https://arxiv.org/html/2412.19442v2
LLM Transformer Model Visually Explained - Polo Club of Data Science, accessed July 20, 2025, https://poloclub.github.io/transformer-explainer/
Introduction to Transformers and Attention Mechanisms | by Rakshit Kalra - Medium, accessed July 20, 2025, https://medium.com/@kalra.rakshit/introduction-to-transformers-and-attention-mechanisms-c29d252ea2c5
LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium, accessed July 20, 2025, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
Understanding and Coding the KV Cache in LLMs from Scratch - Sebastian Raschka, accessed July 20, 2025, https://sebastianraschka.com/blog/2025/coding-the-kv-cache-in-llms.html
arxiv.org, accessed July 20, 2025, https://arxiv.org/html/2407.18003v4
KV Caching in LLMs, explained visually - Daily Dose of Data Science, accessed July 20, 2025, https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/
Key Value Cache | Continuum Labs, accessed July 20, 2025, https://training.continuumlabs.ai/inference/why-is-inference-important/key-value-cache
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression, accessed July 20, 2025, https://arxiv.org/html/2412.03213v2
How Caching Helps Improve the LLM Inference? - Association of Data Scientists, accessed July 20, 2025, https://adasci.org/how-caching-helps-improve-the-llm-inference/
KV Cache is huge and bottlenecks LLM inference. We quantize them to 2bit in a finetuning-free + plug-and-play fashion. - Reddit, accessed July 20, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/
Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design - arXiv, accessed July 20, 2025, https://arxiv.org/html/2503.18869v2
Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA - arXiv, accessed July 20, 2025, https://arxiv.org/html/2502.10659v1
How Transformers Work: A Detailed Exploration of Transformer Architecture - DataCamp, accessed July 20, 2025, https://www.datacamp.com/tutorial/how-transformers-work
A Beginner's Guide to Self-Attention in Transformers | by Nacho Zobian | Medium, accessed July 20, 2025, https://medium.com/@nachozobian/a-beginners-guide-to-self-attention-in-transformers-baf71a971efd
Simplifying transformer architecture: a beginner's guide to understanding AI magic - Compute with Hivenet, accessed July 20, 2025, https://compute.hivenet.com/post/simplifying-transformer-architecture-a-beginners-guide-to-understanding-ai-magic
The Transformer Attention Mechanism - MachineLearningMastery.com, accessed July 20, 2025, https://machinelearningmastery.com/the-transformer-attention-mechanism/
Is inferencing memory bandwidth limited? : r/LocalLLaMA - Reddit, accessed July 20, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1brcnps/is_inferencing_memory_bandwidth_limited/
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation - arXiv, accessed July 20, 2025, https://arxiv.org/html/2411.17089v2
How an LLM Really Works: Costs, Infrastructure, and the Technical Choices Behind Big Language Models - Red Hot Cyber, accessed July 20, 2025, https://www.redhotcyber.com/en/post/how-an-llm-really-works-costs-infrastructure-and-the-technical-choices-behind-big-language-models/
Efficient Long-Context LLM Inference via KV Cache Clustering - arXiv, accessed July 20, 2025, https://arxiv.org/html/2506.11418v1
Your 1M+ Context Window LLM Is Less Powerful Than You Think | Towards Data Science, accessed July 20, 2025, https://towardsdatascience.com/your-1m-context-window-llm-is-less-powerful-than-you-think/
Long-Context Windows in Large Language Models: Applications in Comprehension and Code | by Adnan Masood, PhD. | Medium, accessed July 20, 2025, https://medium.com/@adnanmasood/long-context-windows-in-large-language-models-applications-in-comprehension-and-code-03bf4027066f
Context-Engineering Challenges & Best-Practices | by Ali Arsanjani | Jul, 2025 | Medium, accessed July 20, 2025, https://dr-arsanjani.medium.com/context-engineering-challenges-best-practices-8e4b5252f94f
KV Cache Optimization: A Deep Dive into PagedAttention & FlashAttention | by M - Medium, accessed July 20, 2025, https://medium.com/foundation-models-deep-dive/kv-cache-guide-part-4-of-5-system-superpowers-framework-realities-kv-cache-in-action-6fb4fb575cf8
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, accessed July 20, 2025, https://blog.vllm.ai/2023/06/20/vllm.html
Introduction to vLLM and PagedAttention | Runpod Blog, accessed July 20, 2025, https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
A Survey on Large Language Model Acceleration based on KV Cache Management - arXiv, accessed July 20, 2025, https://arxiv.org/pdf/2412.19442?
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving - arXiv, accessed July 20, 2025, https://arxiv.org/pdf/2503.24000?
model tells you what to discard: adaptive kv cache compression for llms - arXiv, accessed July 20, 2025, https://arxiv.org/pdf/2310.01801
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs - arXiv, accessed July 20, 2025, https://arxiv.org/html/2310.01801v4
[2506.11418] Efficient Long-Context LLM Inference via KV Cache Clustering - arXiv, accessed July 20, 2025, https://arxiv.org/abs/2506.11418
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching - arXiv, accessed July 20, 2025, https://arxiv.org/html/2504.00970v1
[vLLM vs TensorRT-LLM] #8. KV Cache Quantization - SqueezeBits, accessed July 20, 2025, https://blog.squeezebits.com/vllm-vs-tensorrtllm-8-kv-cache-quantization-35079
PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs - arXiv, accessed July 20, 2025, https://arxiv.org/html/2505.18610v1
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models - arXiv, accessed July 20, 2025, https://arxiv.org/abs/2501.19392
[2403.04643] QAQ: Quality Adaptive Quantization for LLM KV Cache - arXiv, accessed July 20, 2025, https://arxiv.org/abs/2403.04643
Daily Papers - Hugging Face, accessed July 20, 2025, https://huggingface.co/papers?q=KV-cache
vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar, accessed July 20, 2025, https://medium.com/@abonia/vllm-and-pagedattention-a-comprehensive-overview-20046d8d0c61
FreeKV: Boosting KV Cache Retrieval for Efficient LLM ... - arXiv, accessed July 20, 2025, https://arxiv.org/pdf/2505.13109
An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD - Atlarge Research, accessed July 20, 2025, https://atlarge-research.com/pdfs/2025-cheops-llm.pdf
Automatic Prefix Caching - vLLM, accessed July 20, 2025, https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html
vLLM Paged Attention, accessed July 20, 2025, https://docs.vllm.ai/en/v0.9.1/design/kernel/paged_attention.html
[RFC]: KV Cache Offloading for Cross-Engine KV Reuse · Issue #14724 · vllm-project/vllm, accessed July 20, 2025, https://github.com/vllm-project/vllm/issues/14724
[RFC]: KV-Cache Management Standardization for Interop #20492 - GitHub, accessed July 20, 2025, https://github.com/vllm-project/vllm/issues/20492
llm-d/llm-d-kv-cache-manager - GitHub, accessed July 20, 2025, https://github.com/llm-d/llm-d-kv-cache-manager
NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the - GitHub, accessed July 20, 2025, https://github.com/NVIDIA/TensorRT-LLM
Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM, accessed July 20, 2025, https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/
Caching - Hugging Face, accessed July 20, 2025, https://huggingface.co/docs/transformers/cache_explanation
Best Practices for Generation with Cache - Hugging Face, accessed July 20, 2025, https://huggingface.co/docs/transformers/v4.44.0/kv_cache
transformers/src/transformers/cache_utils.py at main · huggingface/transformers - GitHub, accessed July 20, 2025, https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py
Multi-backend support - Hugging Face, accessed July 20, 2025, https://huggingface.co/docs/text-generation-inference/multi_backend_support
Llamacpp Backend - Hugging Face, accessed July 20, 2025, https://huggingface.co/docs/text-generation-inference/backends/llamacpp
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management - arXiv, accessed July 20, 2025, https://arxiv.org/html/2410.00428v1
Notice:Human's prompt, Datasets by Gemini-2.5-Pro-DeepResearch