Syntheva Robotics Blog

Post

Posts with type Post by Syntheva Robotics Blog
  • Posted on

    SARA: Sharded Activation Reduction Architecture for Multi-CPU LLM Inference

    GitHub: source code

    Introduction

    Most LLM infrastructure is built for the datacenter problem: many users, large batches, and maximum aggregate throughput. Robotics is a different problem.

    A robot usually has only one real client: itself. At any moment the model is generating one stream of thought: internal monologue, planning, tool selection, or outward dialogue. In that setting, the key metric is not requests per second. It is token latency for a single sequential stream.

    At the same time, affordable robotic systems often have access to inexpensive CPU resources. What affordable robots often do not have is a large, expensive, and power-hungry GPU budget. That motivated us to build SARA, the Sharded Activation Reduction Architecture: a distributed inference path that uses multiple CPUs to reduce single-stream token latency.