- Posted on
SARA: Sharded Activation Reduction Architecture for Multi-CPU LLM Inference
GitHub: source code
Introduction
Most LLM infrastructure is built for the datacenter problem: many users, large batches, and maximum aggregate throughput. Robotics is a different problem.
A robot usually has only one real client: itself. At any moment the model is generating one stream of thought: internal monologue, planning, tool selection, or outward dialogue. In that setting, the key metric is not requests per second. It is token latency for a single sequential stream.
At the same time, affordable robotic systems often have access to inexpensive CPU resources. What affordable robots often do not have is a large, expensive, and power-hungry GPU budget. That motivated us to build SARA, the Sharded Activation Reduction Architecture: a distributed inference path that uses multiple CPUs to reduce single-stream token latency.