SARA: Sharded Activation Reduction Architecture for Multi-CPU LLM Inference

Introduction

Most LLM infrastructure is built for the datacenter problem: many users, large batches, and maximum aggregate throughput. Robotics is a different problem.

A robot usually has only one real client: itself. At any moment the model is generating one stream of thought: internal monologue, planning, tool selection, or outward dialogue. In that setting, the key metric is not requests per second. It is token latency for a single sequential stream.

At the same time, affordable robotic systems often have access to inexpensive CPU resources. What affordable robots often do not have is a large, expensive, and power-hungry GPU budget. That motivated us to build SARA, the Sharded Activation Reduction Architecture: a distributed inference path that uses multiple CPUs to reduce single-stream token latency.

SARA is designed for a narrow but important deployment regime:

one active LLM client
autoregressive generation, where each token depends on the previous one
multiple cheap CPUs available
low operational complexity preferred over datacenter-scale orchestration

The result is a system that is small, explicit, and practical. Instead of distributing an entire general-purpose serving stack, SARA distributes just the parts of transformer inference that naturally decompose into additive partial results.

Why robotics needs a different optimization target

In a server setting, batching is king. If many requests arrive together, the serving system can amortize expensive operations and push aggregate throughput much higher. That is exactly why systems such as vLLM emphasize continuous batching, high-throughput serving, and large-scale distributed parallelism.

A robotic brain rarely lives in that world. Most of the time there is exactly one stream that matters: the robot's own current reasoning loop. That makes the cost of waiting for the next token painfully visible. Slower token generation directly affects:

internal planning latency
deliberation during tool use
response time in spoken dialogue

For this use case, spare CPUs are more useful than multi-user schedulers. The problem becomes:

Can we turn multiple low-cost CPU shards into one faster sequential inference path for a single model stream?

SARA answers yes, by sharding attention and FFN work across ranks and reducing the resulting partial contributions back into the residual stream.

The core idea

SARA exploits a simple fact about transformer blocks: some expensive parts can be partitioned into independent local computations whose results add back together after a linear projection.

For each transformer layer, SARA splits:

attention work by KV heads, with the corresponding query heads assigned through the query-to-KV replication factor
feed-forward work by FFN channels

Each rank computes only its local slice. The master then sums the full-width residual contributions from all ranks.

At a high level, one token step looks like this:

x = token_embedding(token)

for each layer l:
    x_hat = RMSNorm_att(x)
    broadcast Q8(x_hat) to all ranks
    each rank computes its local attention partial p_att[l, rank]
    x = x + sum_r p_att[l, r]

    x_hat = RMSNorm_ffn(x)
    broadcast Q8(x_hat) to all ranks
    each rank computes its local FFN partial p_ffn[l, rank]
    x = x + sum_r p_ffn[l, r]

logits = W_out * RMSNorm_out(x)

The name "Sharded Activation Reduction Architecture" comes directly from this loop:

the current activation is broadcast in quantized form, saving precious bandwidth
each shard computes a local contribution
the contributions are reduced on the master

The master is not just a coordinator. It also owns rank 0 and computes its own shard locally, then adds the workers' returned partials.

What SARA is, and what it is not

SARA is deliberately opinionated.

It is:

single-stream oriented
CPU-first
latency-focused
static-shard
explicit master-worker reduction

It is not:

a multi-tenant serving platform
a general remote device abstraction
a datacenter scheduler
a system designed primarily around GPU saturation

That distinction matters, because many existing LLM systems solve a different optimization problem.

Comparison with existing approaches

What SARA shares with other distributed inference systems

SARA is not alien to the existing literature or ecosystem. It shares important ground with model-parallel inference systems:

work is partitioned across devices or processes
local partial results are combined into a single forward pass
synchronization still exists at token boundaries
performance depends on the balance between compute and communication

In that sense, SARA belongs to the same broad family as tensor-parallel inference.

Where SARA differs from general-purpose GPU serving stacks

Systems such as vLLM are designed around high-throughput serving, continuous batching, and rich distributed parallelism for many incoming requests. That is the right answer for datacenter serving. It is not automatically the right answer for a robot with one active reasoning stream and a pile of affordable CPUs.

SARA takes the opposite path:

no continuous batching
no multi-user scheduler
no assumption that the best hardware is a large GPU
no attempt to solve every deployment mode at once

The goal is narrower and, for robotics, often more relevant: reduce token latency for one stream by using multiple CPU shards effectively.

Where SARA differs from llama.cpp RPC

The closest comparison in spirit is probably llama.cpp RPC, but the designs are still quite different.

llama.cpp RPC exposes remote ggml devices and, by default, distributes model weights and the KV cache across local and remote devices according to available memory. Its own README currently describes the RPC backend as a proof-of-concept and warns that it is fragile and insecure on open networks.

SARA takes a much narrower route. Each participant can load the model locally, the shard split is fixed up front, and runtime communication consists only of quantized activations and quantized reduced partials for attention and FFN. There is no general remote device layer to manage, no remote tensor cache to reason about during token generation, and no broad placement policy to tune.

That narrower scope is why we believe SARA is simpler, more robust, and more efficient for this specific use case. It is not trying to be a universal inference fabric. It is trying to make one robot think faster with a few cooperating CPUs.

A fairer way to phrase the claim is this:

While llama.cpp authors are understandably cautious about their complex RPC system, we believe SARA is the better fit for narrow single-user robotic brain deployments because its protocol surface, sharding policy, and runtime dataflow are all much smaller and more explicit.

That is a use-case claim, not a universal one. And for deployment work, use-case fit matters more than fashionable generality.

Efficiency and benchmark interpretation

We benchmarked SARA in the strictest possible way: against the same machine.

Method

Baseline:

OMP_NUM_THREADS=10 ./ligguf-distrib -m Mistral-7B-Instruct-v0.3-Q8_0.gguf -n 64 hi

Distributed:

# worker
OMP_NUM_THREADS=5 ./ligguf-distrib -m Mistral-7B-Instruct-v0.3-Q8_0.gguf -W 1/2 -M 19095

# master
OMP_NUM_THREADS=5 ./ligguf-distrib -m Mistral-7B-Instruct-v0.3-Q8_0.gguf -w 127.0.0.1:19095 -n 64 hi

All tok/s values are generation-only values reported by the binary.

Why a ratio around 0.95 is already excellent

This point is crucial.

The distributed 5+5 setup and the single-node 10-thread setup are using the same physical machine and, effectively, the same total core budget. That means the theoretical upper bound is 1.0x. The distributed run cannot be meaningfully faster than that in a stable sense, because there is no extra compute being created. We are only changing how the same compute is organized.

So if the distributed configuration lands around 0.95x, that does not indicate failure. It indicates that the cost of sharding, transport, and reduction is very small. On the same machine, that is exactly the result we want.

One run slightly above 1.0x should be treated as measurement noise or host-side timing fluctuation, not as a physically meaningful speedup beyond the available silicon.

What the repeated runs show

Across repeated 64-token runs, SARA stayed close to the single-node baseline, with the meaningful ratios clustering in roughly the low-to-mid 90% range and an isolated above-parity outlier that should not be over-interpreted.

That is a strong result, because it demonstrates that:

the communication path is lightweight enough
the reduction scheme does not destroy the value of the extra CPU shard
same-machine overhead is close to negligible in practice

This is the right first milestone. Before a distributed design can produce real gains across multiple machines, it first has to prove that it does not waste performance on a same-machine split. SARA passes that test.

Why near-parity on one machine matters

A same-machine benchmark is not supposed to show miraculous speedups. It is supposed to answer a harder question:

If we split the work and add coordination overhead, how much of the original compute budget do we keep?

SARA keeps most of it.

That is important because real deployment gains come when the system can recruit additional physical CPUs beyond the baseline machine. If the architecture already wastes too much performance on localhost, it will collapse once networked across boards. If it remains near parity on the same host, then multi-host deployments become realistic.

That is exactly the message you want before moving from experiment to deployment.

Practical advantages for robotic systems

For robotic brain workloads, SARA has several attractive properties.

1. It targets the right metric

The design is about sequential token latency for one active stream, not aggregate datacenter throughput.

2. It uses cheap hardware well

A few ordinary CPUs can cooperate to reduce latency without requiring a large GPU. That matters for cost, power, thermals, and deployment flexibility.

3. It keeps the runtime small

The protocol is tiny. The control flow is easy to trace. There are very few moving parts compared with a general-purpose serving stack.

4. It is explicit

The sharding scheme is visible in the code. The reduction points are visible in the code. The transport format is visible in the code. That transparency is valuable when the LLM is part of a larger real-time robotic system.

5. It is stable enough to matter

The repeated runs do not show a fragile toy. They show a system that stays close to the baseline even under a harsh same-machine comparison. That is already a deployment-relevant result.

Limitations and future work

SARA is promising, but it is not magic.

Current tradeoffs include:

static even sharding rather than adaptive load balancing
a master-coordinated reduction rather than a more general collective
synchronization twice per layer, once for attention and once for FFN
local model availability on each participant
performance sensitivity to interconnect latency and CPU balance

Those are acceptable tradeoffs for the intended deployment target. In fact, many of them are the reason the implementation remains so understandable. But they also point toward future work:

better handling of heterogeneous nodes
more careful pipelining of communication and compute
smarter shard sizing
larger-scale multi-host measurements
broader sampling and decoding support on top of the same core dataflow

Conclusion

SARA is a focused answer to a focused problem.

In robotics, the LLM is usually serving one client: the robot itself. That makes single-stream token latency the metric that matters. SARA attacks that metric directly by sharding attention heads and FFN channels across multiple CPUs, shipping quantized activations, and reducing full-width partial results back into the residual stream.

The implementation is simple by design. The protocol is tiny. The dataflow is explicit. The benchmark result is exactly the encouraging kind: on the same machine, where 1.0x is the hard ceiling, SARA remains close to parity. That means the distributed path is not eating the value it is supposed to unlock.

We therefore view SARA as a practical architecture for low-cost robotic brain deployment. It is not a general-purpose serving empire, and it does not need to be. It is a compact, stable, and efficient way to use multiple CPUs to make one robot think faster.

No related post found