Loom: A Scalable Computer Architecture for Looped Transformers

Mehmet Kerem Turkcan

Loom is a computer architecture for looped transformers. Its weights are derived analytically, requiring no training data and no gradient descent. Programs are written in C, compiled to a 22-opcode ISA (including indirect memory write), and executed as iterated matrix multiplications through 8 fixed-weight transformer layers. The entire machine state fits in a fixed-size bipolar tensor. The architecture scales from 146×512 (7.4 MB, 320 instruction slots) through 155×1024 (4.7M parameters, 928 slots) to 164×2048 (9.0M parameters, 1,792 slots).

Demos
Sorting + Architecture Viz
Bubble sort with real-time visualization of all 8 transformer layers. Watch activations flow through the neural architecture as each comparison executes.
Visual Real Activations 3D
Sorting Visualizer
Bubble sort where every comparison and swap is a sequence of transformer forward passes.
Visual Animation
Snake Game
Every tick of this playable game, from movement to collision to food spawning, is computed by the transformer. About 84 steps per frame.
Game 210 Instructions
Game of Life
Conway's rules in C, executed on the larger 164×2048 model. 570 instructions. Click cells, step through generations.
Interactive Larger Model 570 Instructions
DOOM E1M1
Playable E1M1 with BSP rendering from the real WAD file. Enemy AI state machines run as compiled C on the ISA interpreter.
DOOM BSP Renderer Playable WAD File
Doom Deathmatch
DOOM WAD textures, sprites, and HUD with game logic executed entirely on the transformer. 531 instructions, about 500 steps per tick.
DOOM Deathmatch 531 Instructions
9x9 Sudoku Solver
Full 9x9 Sudoku with backtracking on the compact 146x512 model (8.8 MB). ISA interpreter or ONNX transformer.
Solver Backtracking 284 Instructions 146x512
C Debugger + Architecture Viz
Step through compiled C with real-time visualization of all 8 transformer layers. Source highlighting, variable watch, and neural activation flow.
Debugger Real Activations 3D
C Debugger
Step through compiled C one instruction at a time. Watch variables, source highlighting, and the state tensor update live.
Step Source Map Variables
Live C REPL
Write C, compile it with Pyodide, run it on the transformer. The full compile-to-run loop happens entirely client-side.
Pyodide ONNX
How Does It Work?
Opcode Data Flow
See how each of the 22 opcodes routes data through 8 transformer layers. Per-layer mechanism badges, state tensor layout, instruction format.
Architecture 8 Layers 22 Opcodes
Attention Patterns
Interactive guide to the five attention mechanisms: PC fetch, 3-head operand read, pointer dereference, content-addressable search, and write-via-attention.
Architecture Q=K Symmetric 5 Patterns
Contributions
Architecture
State X ∈ {-1, +1}d×n # bipolar tensor, entire machine state, e.g. 155×1024 or 164×2048 Rows 0-29 Instruction encoding # 3 × log₂ 1024 = 30 rows for [a, b, c] fields Rows 30-37 Memory, 64 slots # 8-bit signed integers in bipolar encoding Rows 38-83 Scratchpad # internal routing between layers Rows 84-93 Program Counter # 10-bit binary address Rows 94-103 Position Encoding # column identity for attention addressing Rows 104-145 Buffer, 3N+N+logn=42 # buf_a 8, buf_b 8, buf_c 8, find_temp 8, load_temp 10 Rows 146-153 Address Tags # memory column identity for LOAD/FIND Row 154 Indicator # marks scratchpad columns One forward pass = one instruction: L1 Fetch instruction at PC # attention over position encoding L2 Read operands + decode opcode # 3-head attention, opcode-gated FFN routing L3 Indirect read + correction # LOAD/FIND attention + snap to ±1 L4 Subtract, direct borrow chain # single-layer a−b via 6-threshold pattern L5 Write result to memory # attention-based memory store L6 Branch flag + PC increment # merged condition check + PC+1 L7 Branch select # choose branch target or PC+1 L8 Error correction # clamp values to ±1
Layer Design

Giannou et al., 2023, describe a 13-layer construction for SUBLEQ; their released implementation uses 10 layers for the same single instruction. Our architecture fits 22 opcodes in 8 layers. Four design choices keep the layer count below the baseline despite the richer ISA:

1. Fused instruction decode. Giannou et al. use separate layers for instruction fetching and operand routing. We fold opcode decoding into the memory-read layer, L2, via a 3-head attention stage followed by an FFN with opcode-gated routing. The FFN examines the most significant bits of the instruction address to distinguish extended-mode opcodes at addresses 0 to 31 from SUBLEQ at addresses ≥32, then copies operands to the appropriate scratchpad positions for each operation.

2. Opcode-as-operand-routing. All 22 operations reduce to operand preparation for a shared subtract core. ADD a,b, for instance, loads mem[a] into the minuend and −mem[b] into the subtrahend so that subtraction yields addition. AND, OR, and XOR are realized through delta corrections applied to the subtrahend.

3. Direct borrow-chain subtraction. L4 replaces the classical 3-layer two's complement approach, which flips bits, adds 1, then adds, with a single layer that computes a−b directly using a 6-threshold-per-bit ReLU pattern with built-in carry propagation.

4. Merged branch + PC update. L6 merges the branch-flag computation and PC increment into one layer, since these are independent FFN computations: flag reads scratchpad and opcodes, while PC+1 reads PC bits. Scratchpad correction is folded into L3's FFN alongside indirect read.

L3 serves two purposes: its 2-head attention handles LOAD for pointer dereference and FIND for content-addressable search, while its FFN clamps soft-attention outputs to exact bipolar values before the ALU stage. The 3-head attention routing in L2 produces outputs that deviate further from ±1 than the simpler single-instruction case, making this intermediate correction necessary for numerical stability.

Comparison
This Work Percepta, Tzamos 2026 Autoregressive LLM
Paradigm Looped, fixed state Autoregressive, trace Autoregressive, tokens
State Fixed-size tensor, 155×1024 Growing execution trace Growing token sequence
Cost per step O(1), same 8 layers O(log t) via HullKVCache O(t) or O(log t)
Weights Analytically computed Constructed or trained, unspecified Trained on data
Architecture 8 layers, d=155, custom ISA 7 layers, d=36, 18 2D-heads Varies, billions of params
Program location In the state tensor Interpreter in weights; program as input tokens In the prompt
Programs C → custom ISA → state tensor C/C++ → WASM bytecode → input tokens Natural language prompts
Determinism Exact, no sampling Exact, greedy decoding Probabilistic
Memory Constant, fixed tensor Grows with execution time Grows with context
Related Work

Looped Transformers as Programmable Computers by Giannou et al., 2023, showed that a fixed-depth transformer with analytically constructed weights can execute the SUBLEQ instruction. Our work addresses what follows: how to architect a multi-operation ISA on this substrate, how to compile high-level programs to it, and how the resulting architecture scales.

Extensions of the programmable-computer framework. Looped ReLU MLPs May Be All You Need as Practical Programmable Computers by Liang et al., 2024, shows that a 23-layer ReLU-MLP can also emulate a programmable computer with analytical weights, using MLPs instead of transformers. Simulation of Graph Algorithms with Looped Transformers by Giannou et al., 2024, extends the original framework with a dual-memory SUBLEQ variant for graph problems but remains single-instruction. Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers by Li et al., 2025, extends looped transformer algorithmic reasoning from graphs to hypergraphs. On Expressive Power of Looped Transformers by Xu and Sato, ICML 2025, establishes formal approximation rates for looped transformers and proposes timestep encoding to enhance expressivity.

Exact neural computation. Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks by Back de Luca et al., NeurIPS 2025, proves that two-layer networks can learn to execute binary addition, multiplication, and Subtract-and-Branch-if-Negative exactly, using the NTK framework with trained rather than analytically constructed networks. Algorithmic Language Models with Neurally Compiled Libraries by Saldyt and Kambhampati, 2024, augments LLaMA3 with memory, registers, and basic operations, compiling algorithms into differentiable libraries.

Analytical weight construction and compilation. Tracr by Lindner et al., 2023, compiles RASP programs to transformer weights for interpretability research. Thinking Like Transformers by Weiss et al., 2021, introduced RASP as a programming language that maps to transformer primitives. Both construct weights analytically but target interpretability, not general-purpose computation. ALTA by Shaw et al., 2024, extends RASP/Tracr with loops and compiles to Universal Transformers. Transformers are Efficient Compilers, Provably by Bai et al., 2024, shows that trained transformers can perform compilation. Weights to Code, 2026, works in the reverse direction, extracting executable programs from trained transformer weights.

Transformer Turing completeness. Autoregressive Large Language Models are Computationally Universal by Schuurmans et al., 2024, shows that autoregressive decoding itself realizes universal computation via Lag systems. Constant Bit-size Transformers Are Turing Complete by Li and Wang, 2025, proves constant-parameter transformers are Turing complete with sufficient context. Softmax Transformers are Turing-Complete by Jiang et al., 2025, settles the open question for softmax attention. Efficient Turing Machine Simulation with Transformers by Li and Wang, 2025, shows that sparse attention with fixed geometric offsets suffices for efficient universal computation.

Trained neural computers. Can LLMs Be Computers? by Tzamos et al. at Percepta, 2026, implements a WebAssembly interpreter inside a 7-layer autoregressive transformer with d=36 and 18 two-dimensional attention heads. The model generates execution traces token-by-token. HullKVCache uses 2D attention heads to build convex hulls over prior tokens, giving O(log t) lookups instead of linear scans, yielding 30k tok/s throughput and multi-million-step executions. The weight construction method is not fully specified in the blog post; the authors describe both "implementing" the interpreter in weights and the model "learning" it. The approach is architecturally distinct from ours: in Percepta, the execution logic, a WASM interpreter, is encoded in the weights, with programs supplied as input tokens and executed as a growing autoregressive trace. In Loom, the weights are program-independent. Programs live in the state tensor, and the same fixed-weight model executes any compiled program at O(1) per step with no state growth.

Program execution and simulation. Universal Length Generalization with Turing Programs by Hou et al., 2024, decomposes algorithms into Turing machine steps as chain-of-thought, constructing RASP programs that simulate arbitrary TMs. Code Simulation as a Proxy for High-order Tasks in Large Language Models by La Malfa et al., 2025, studies LLMs simulating code execution as a reasoning proxy.

All model weights are analytically derived. No training data, no gradient descent, no GPU hours. The optimized ONNX model is 7.4 MB at 146×512 (with fused KTQ and onnxsim). The standard model is about 16 MB at 155×1024. The Game of Life and Sudoku models are 28 to 31 MB at 164×2048. Demos require a modern browser with WebAssembly support.