The AI Performance Engineering Meetup
- This is a recap of an online meetup
- This meetup has been running for a while. There are many previous talks on the AI Performance Engineering Meetup - YouTube Channel
- The meetup is organized by Chris Fregly and others.
- Chris Fregly is
- the author of the book AI Performance Engineering and
- runs a blog on Medium.
- Maintains a GitHub repository with code examples from the book and code and slides from the talks. Though on first inspection the repo only covers a few of the talks
This is coverage of the second talk in the meetup
KV Cache Efficiency + Context “Platform” Engineering
This presentation will include demos and code with a focus on improving KV-cache hit rates as well as introducing a methodology called Context “Platform” Engineering to design and optimize AI infrastructure for Agent Swarm Context at scale. Context Platform Engineering was recently featured in the CES2026 keynote by Jensen Huang, CEO of NVIDIA. This presentation is related to a recent AIE CODE Summit talk in December 2025.
This talk was very short in time. However if like me you are new to this area it certainly delivers some new insights about scaling .
Context Platform Engineering \Rightarrow can lead to higher cache hit rates for inference workloads.
The problem is that at scale the KV cache hit rates drop - leading to higher latency and lower throughput. This is because the cached are getting evicted due to lack of Memory.
WEKAIO has a solution that uses a combination of software and hardware to improve the cache hit rates.
In hopper the memory in CPU throughput is limited compared to the throughput vs the Network
Resources
Citation
@online{bochman2026,
author = {Bochman, Oren},
title = {KV {Cache} {Efficiency} \& {Context} {Platform}
{Engineering}},
date = {2026-01-19},
url = {https://orenbochman.github.io/posts/2026/2026-01-19-kv-cache-efficiency/},
langid = {en}
}