Loom is a computer architecture for looped transformers. Its weights are derived analytically, requiring no training data and no gradient descent. Programs are written in C, compiled to a 22-opcode ISA (including indirect memory write), and executed as iterated matrix multiplications through 8 fixed-weight transformer layers. The entire machine state fits in a fixed-size bipolar tensor. The architecture scales from 146×512 (7.4 MB, 320 instruction slots) through 155×1024 (4.7M parameters, 928 slots) to 164×2048 (9.0M parameters, 1,792 slots).
Giannou et al., 2023, describe a 13-layer construction for SUBLEQ; their released implementation uses 10 layers for the same single instruction. Our architecture fits 22 opcodes in 8 layers. Four design choices keep the layer count below the baseline despite the richer ISA:
1. Fused instruction decode. Giannou et al. use separate layers for instruction fetching and operand routing. We fold opcode decoding into the memory-read layer, L2, via a 3-head attention stage followed by an FFN with opcode-gated routing. The FFN examines the most significant bits of the instruction address to distinguish extended-mode opcodes at addresses 0 to 31 from SUBLEQ at addresses ≥32, then copies operands to the appropriate scratchpad positions for each operation.
2. Opcode-as-operand-routing. All 22 operations reduce to operand preparation for a shared subtract core. ADD a,b, for instance, loads mem[a] into the minuend and −mem[b] into the subtrahend so that subtraction yields addition. AND, OR, and XOR are realized through delta corrections applied to the subtrahend.
3. Direct borrow-chain subtraction. L4 replaces the classical 3-layer two's complement approach, which flips bits, adds 1, then adds, with a single layer that computes a−b directly using a 6-threshold-per-bit ReLU pattern with built-in carry propagation.
4. Merged branch + PC update. L6 merges the branch-flag computation and PC increment into one layer, since these are independent FFN computations: flag reads scratchpad and opcodes, while PC+1 reads PC bits. Scratchpad correction is folded into L3's FFN alongside indirect read.
L3 serves two purposes: its 2-head attention handles LOAD for pointer dereference and FIND for content-addressable search, while its FFN clamps soft-attention outputs to exact bipolar values before the ALU stage. The 3-head attention routing in L2 produces outputs that deviate further from ±1 than the simpler single-instruction case, making this intermediate correction necessary for numerical stability.
| This Work | Percepta, Tzamos 2026 | Autoregressive LLM | |
|---|---|---|---|
| Paradigm | Looped, fixed state | Autoregressive, trace | Autoregressive, tokens |
| State | Fixed-size tensor, 155×1024 |
Growing execution trace | Growing token sequence |
| Cost per step | O(1), same 8 layers | O(log t) via HullKVCache | O(t) or O(log t) |
| Weights | Analytically computed | Constructed or trained, unspecified | Trained on data |
| Architecture | 8 layers, d=155, custom ISA | 7 layers, d=36, 18 2D-heads | Varies, billions of params |
| Program location | In the state tensor | Interpreter in weights; program as input tokens | In the prompt |
| Programs | C → custom ISA → state tensor | C/C++ → WASM bytecode → input tokens | Natural language prompts |
| Determinism | Exact, no sampling | Exact, greedy decoding | Probabilistic |
| Memory | Constant, fixed tensor | Grows with execution time | Grows with context |
Looped Transformers as Programmable Computers by Giannou et al., 2023, showed that a fixed-depth transformer with analytically constructed weights can execute the SUBLEQ instruction. Our work addresses what follows: how to architect a multi-operation ISA on this substrate, how to compile high-level programs to it, and how the resulting architecture scales.
Extensions of the programmable-computer framework. Looped ReLU MLPs May Be All You Need as Practical Programmable Computers by Liang et al., 2024, shows that a 23-layer ReLU-MLP can also emulate a programmable computer with analytical weights, using MLPs instead of transformers. Simulation of Graph Algorithms with Looped Transformers by Giannou et al., 2024, extends the original framework with a dual-memory SUBLEQ variant for graph problems but remains single-instruction. Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers by Li et al., 2025, extends looped transformer algorithmic reasoning from graphs to hypergraphs. On Expressive Power of Looped Transformers by Xu and Sato, ICML 2025, establishes formal approximation rates for looped transformers and proposes timestep encoding to enhance expressivity.
Exact neural computation. Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks by Back de Luca et al., NeurIPS 2025, proves that two-layer networks can learn to execute binary addition, multiplication, and Subtract-and-Branch-if-Negative exactly, using the NTK framework with trained rather than analytically constructed networks. Algorithmic Language Models with Neurally Compiled Libraries by Saldyt and Kambhampati, 2024, augments LLaMA3 with memory, registers, and basic operations, compiling algorithms into differentiable libraries.
Analytical weight construction and compilation. Tracr by Lindner et al., 2023, compiles RASP programs to transformer weights for interpretability research. Thinking Like Transformers by Weiss et al., 2021, introduced RASP as a programming language that maps to transformer primitives. Both construct weights analytically but target interpretability, not general-purpose computation. ALTA by Shaw et al., 2024, extends RASP/Tracr with loops and compiles to Universal Transformers. Transformers are Efficient Compilers, Provably by Bai et al., 2024, shows that trained transformers can perform compilation. Weights to Code, 2026, works in the reverse direction, extracting executable programs from trained transformer weights.
Transformer Turing completeness. Autoregressive Large Language Models are Computationally Universal by Schuurmans et al., 2024, shows that autoregressive decoding itself realizes universal computation via Lag systems. Constant Bit-size Transformers Are Turing Complete by Li and Wang, 2025, proves constant-parameter transformers are Turing complete with sufficient context. Softmax Transformers are Turing-Complete by Jiang et al., 2025, settles the open question for softmax attention. Efficient Turing Machine Simulation with Transformers by Li and Wang, 2025, shows that sparse attention with fixed geometric offsets suffices for efficient universal computation.
Trained neural computers. Can LLMs Be Computers? by Tzamos et al. at Percepta, 2026, implements a WebAssembly interpreter inside a 7-layer autoregressive transformer with d=36 and 18 two-dimensional attention heads. The model generates execution traces token-by-token. HullKVCache uses 2D attention heads to build convex hulls over prior tokens, giving O(log t) lookups instead of linear scans, yielding 30k tok/s throughput and multi-million-step executions. The weight construction method is not fully specified in the blog post; the authors describe both "implementing" the interpreter in weights and the model "learning" it. The approach is architecturally distinct from ours: in Percepta, the execution logic, a WASM interpreter, is encoded in the weights, with programs supplied as input tokens and executed as a growing autoregressive trace. In Loom, the weights are program-independent. Programs live in the state tensor, and the same fixed-weight model executes any compiled program at O(1) per step with no state growth.
Program execution and simulation. Universal Length Generalization with Turing Programs by Hou et al., 2024, decomposes algorithms into Turing machine steps as chain-of-thought, constructing RASP programs that simulate arbitrary TMs. Code Simulation as a Proxy for High-order Tasks in Large Language Models by La Malfa et al., 2025, studies LLMs simulating code execution as a reasoning proxy.