The Tetris on the Home Page

There’s a small Tetris board on the home page of this blog, playing itself. The caption offers “take a turn →”, and if you click it the bot hands you the keyboard; press escape and it takes over again, calmly fixing whatever mess you left. (The same widget haunts the 404 page, where it has more room to brood.)

This post is about what’s inside it: a guideline-accurate Tetris engine in Rust, a search bot that out-digs Cold Clear 2 in the one comparison I could make rigorous, a neural network that did not survive its own benchmark, and an architecture for running tree search inside a 60 Hz frame budget without dropping a frame. The whole thing arrives in your browser as 114 KB of WebAssembly.

an AI plays.

Three weekends in 2023, then silence

In March 2023 I wanted to learn Bevy , and a Tetris clone seemed like a good excuse. The early commit log is honest about how that went: “fix: ghost block glitch”, “fix: wall kicks (untested)”, “fix: score and some sounds”. Twelve commits, a wasm deployment on April Fools’ Day, and then nothing. The commit histogram has a three-year hole in it.

When I came back in April 2026, the first commit was a four-major-version Bevy upgrade. The second thing I did was more telling: instead of adding features, I started pulling every game rule out of the ECS into a module that knows nothing about rendering. Almost everything I ended up caring about followed from that one move.

The engine is a function

The core crate, tetr-core, is just under fourteen thousand lines of Rust with zero Bevy types in it. The engine is a pure function: (seed, input frames) -> (state, events). It never reads a wall clock, and every random draw comes from one seeded RNG. Run the same seed with the same inputs and you get a byte-identical game, on macOS, on Linux, or inside a browser.

That purity is the load-bearing decision of the whole project. It makes replays trivial and lockstep multiplayer possible someday, and it made the part that actually mattered cheap: headless AI evaluation, thousands of games in a benchmark harness with no window, no frame loop, and no flakiness.

The rules themselves are guideline Tetris, taken seriously: SRS rotation with the full five-test wall kicks, a seeded 7-bag randomizer, hold, Back-to-Back, combos, and three lock-down modes. The acceptance tests are named after the spec they enforce (acceptance_kick.rs, acceptance_seven_bag.rs, acceptance_t_spin.rs), and they cover the corners people usually skip, like the guideline §7.5 exception where a T piece kicked into a slot by the fifth wall-kick test earns a full T-spin rather than a mini.

One more rule, self-imposed: the AI plays through the exact same input surface as a human. It emits key presses with DAS timing; it cannot teleport pieces. Whatever it achieves, it achieves under the same physics you do.

Teaching it to play

The bot is an evaluator plus a search, and the seam between them is the most useful idea I borrowed from Cold Clear : scores split into a Value (how good is this resting board: holes, transitions, wells, height) and a Reward (what did this particular move earn: clear type, spin, Back-to-Back), with combo and B2B state threaded along the search path so rewards stay honest in chains.

Two hand-written evaluators sit behind that interface: the classic Dellacherie-Thiery linear features, and later a verbatim port of Cold Clear 2’s freestyle evaluation with its published weights. Three searches sit on the other side: a one-ply greedy, a batch beam, and a best-first graph search with a transposition table. A unit test pins the beam at depth 1 to reproduce greedy exactly, so the tiers can’t silently drift apart.

The finding I didn’t expect: the evaluator sets the ceiling, and the search only decides how efficiently you reach it. Best-first scales smoothly with node budget right up to the eval’s optimum, around 0.67 attack per piece with the tuned weights, and then goes flat. Searching to depth 8 actually scored below depth 6 with a wider beam. Every “stronger search” experiment converged to the same number; the number only moved when I changed what the evaluator wanted.

Benchmarking against a real opponent

I wanted to know how the bot compared to Cold Clear 2 , MinusKelvin’s open-source bot and the strongest one I could actually run. CC2 publishes no performance numbers, so I had to generate the baseline myself: a referee that speaks the Tetris Bot Protocol to the CC2 binary over stdio, feeds both bots the same seeded bag, and scores everything with my engine’s attack tables.

The first metric was attack per piece from an empty board. My bot scored 1.70. CC2 scored 0.69. I would have loved to believe that, and it was nonsense. The metric was gameable: with no incoming garbage, an empty-board APP contest rewards combo-spam, and my bot had learned to farm the meter while CC2 played real B2B and T-spin attack that survives actual pressure. A search will farm any number you let it farm. I threw the metric out.

The honest replacement is a downstack race: bury both bots under nine rows of holey garbage and count pieces until it’s clean. Combo-farming doesn’t help you there. Over 24 seeds, my bot cleared in 13.6 pieces against CC2’s 16.6, roughly 18% better, and giving CC2 four times the thinking time made it slightly worse, so the gap is a difference in what the two bots want, not in how long they search. CC2 optimizes attack; my evaluator digs cleaner.

Head-to-head versus play was the obvious next claim, and I had to refuse it. My naive harness said I won 3-0, but the win was an artifact: the bot protocol has no message for incoming garbage (you can read CC2’s message enum and see it), so every garbage injection forces a stop-and-restart that throws away CC2’s search tree. Benchmarking a bot you’ve lobotomized isn’t a result. The fix was to port CC2’s evaluator natively into my framework, so both brains run on the same engine, the same search, and the same garbage: fair by construction. The port is faithful: it digs nine rows in about 17 pieces, matching what the real binary did over the protocol. In that arena, my tuned evaluator wins both axes at every search depth I tested, taking 14 of 16 versus games at the settings that ship.

Ten times faster was the goal; seven is what I banked

At launch, the best-first attack bot needed 115-158 ms per piece. I wanted 10x, and I started the way everyone starts: by guessing. I guessed the evaluator’s arithmetic was the cost. Wrong: it measured at 0.4 µs, faster than the supposedly simpler linear eval. I guessed the per-node board clone. Wrong again, about 3% of per-child time. Two confident refactor plans died in the profiler before I wrote any code.

What the criterion benchmarks and macOS sample actually showed: line-clear detection re-scanning the whole board on every lock (5-24 µs, scaling with stack height), the standard library’s SipHash dominating the transposition table, and collision checks crawling through a cell grid during move generation. The cure for all three was the same data structure: a bitboard. The search state became a Copy struct with one u64 per column, so forking a node is a stack copy with no allocation, full rows fall out of an AND-fold, and collision is a single bit test. Move generation was made generic over an Occupancy trait so it runs directly on bits while the SRS kick tables stay in the engine, never re-encoded, so the search literally cannot disagree with the game about rotation.

Every change had to pass one gate: attack-per-piece numbers byte-identical before and after. The results, per scenario:

Board	Before	After	Speedup
Clean	115 ms/piece	25 ms/piece	4.6x
Nine rows of cheese	156 ms/piece	22 ms/piece	7.1x
Heavy faucet garbage	158 ms/piece	22 ms/piece	7.2x

The differential tests that held the bitboard to the engine’s exact behavior caught three subtle landmines, and one of them turned out to be a real engine bug: a row completed entirely inside the hidden buffer zone above the visible field was being scored but never removed. The bitboard work paid for itself in correctness before it paid in speed.

The final 2.2x to a clean 10x would have required SIMD, threads (determinism is load-bearing, and wasm threads are their own adventure), or a cheaper move generator that gives up tucks and T-spins, which is to say the bot’s whole point. I declined and banked the 7x.

Who actually wrote this

The 2023 commits are mine, keystroke by keystroke. Most of the 2026 ones are not. I work on AI coding agents during the day, and for the comeback I mostly directed one instead of typing: Claude wrote the bulk of the engine rewrite, the search, the benchmark harness, and the embed, across long sessions where my job was to pick the goals, argue about the design, and read every diff. The commit log gives it away if you know what to look for. A human rarely names a commit “extract shared RootKey for stale-run + transposition identity (zero-slop batch 11/N)”.

The code generation is table stakes by now. What I enjoyed more was letting the agent run experiments. The benchmark harness reduces a bot configuration to a single number, and a single number is a fitness function, so I pointed an autonomous hill-climbing loop at the evaluator’s weights: perturb them, play a batch of headless games, keep the change if the number improves, repeat until the time budget expires. It worked. The climbed weights lifted every scenario in the behavior suite, helped most exactly where play is hardest (a 35% gain under the heaviest garbage pressure), and they are what the bot on the home page plays today. The loop also rediscovered classic tuning failures without being told about them. Climbing on the downstack fitness produced a brilliant digger that then lost versus matches. Climbing on the noisy versus fitness reported a +0.50 margin that came out to exactly +0.00 on held-out seeds. And after I had to kill two multi-hour runs by hand, every climb got a hard time budget baked in.

What the loop could not do is climb past the evaluator’s ceiling. Hill-climbing tunes toward whatever the fitness function can already see, and it stalled at the same 0.67 attack per piece as everything else. Getting further meant replacing the evaluator’s eyes entirely, which is where the neural net came in.

The value net that didn’t make it

For a while the repo contained a neural evaluator: a small MLP trained in JAX, run in the browser through Burn. It failed twice, each time more instructively.

The first failure was the obvious one. I distilled it from the weakest hand evaluator, so it faithfully learned to be exactly that strong. You don’t pass the leader by imitating the back of the pack.

The second failure was better. Retrained on games from the strong evaluator, the net had only ever seen the games of a bot that never dies. It had no examples of tall boards leading to death, so it scored dangerous stacks highly: in its training data, tall boards were always followed by lucrative digging. The beam search, dutifully maximizing the net’s opinion, steered straight into a top-out within fifteen pieces. Switching from hand features to the raw board (a 24x10 grid plus combo and B2B state, 242 inputs) fixed the model’s capacity and changed nothing about this, because the problem was distribution shift, not expressiveness: the search drives toward boards the net has never seen, where it extrapolates garbage. Even DAgger-style retraining on the net-bot’s own games just made it monotonically more pessimistic: validation error went 12.9, 16.2, 21.9 across iterations.

The conclusion, written up in a postmortem that now lives in the repo where the code used to: offline value regression from one policy’s data caps out at that policy. Getting past the hand evaluator needs actual self-play RL, which is a different project. I deleted the entire stack (the trainer, the inference crate, and Burn’s dependency tree, which had been most of my wasm size budget) and committed docs/value-net-postmortem.md in its place. Best trade of the project.

A search that fits in a frame

All of this has to run on a browser’s main thread. The shipped bot’s search took 60-120 ms per decision when run to completion, which is four to seven dropped frames every single piece. Web Workers would mean SharedArrayBuffer and cross-origin isolation headers; for a bot whose whole budget is one human reaction window, that’s a lot of machinery to avoid an architectural fix.

The architectural fix is to stop letting the search touch time at all. Work and time are separate currencies. The search session, the Mind, measures work in nodes and exposes four verbs: reroot() to aim at a new position, think(quantum) to spend a fixed slice of nodes, best() to read the current answer at any moment, and nodes_expanded() to meter it. It never reads a clock. Where thinking happens is a venue decision made by the caller: benchmarks run a blocking venue that drains the budget in one call, and the browser runs a sliced venue that spends one 16-node quantum per frame. A controller pumps the venue every poll, buffers the finished decision, and only acts when the bot’s human-like reaction delay has elapsed.

The shipped budget is 192 nodes, and the number isn’t arbitrary: a 200 ms reaction window at 60 Hz is 12 polls, times 16 nodes per poll on wasm. The budget is exactly the window’s capacity, and a test fails if anyone raises the constant, because past that point the bot thinks slower than it reacts, which is a gameplay decision someone should make on purpose rather than by editing an integer.

Because the search spends nodes rather than milliseconds, slicing it changed nothing about its play: same node count, same decisions, byte-identical games, now spread across frames. A release-build browser trace confirmed the feel: the worst main-thread task during play is 24 ms and the p99 is 2.4 ms, where each piece used to cost a 60-120 ms stall.

114 kilobytes, committed

The full game (menus, settings, bloom, the Bevy renderer) is a couple of megabytes of wasm, and the home page loads none of it. The widget is a separate 466-line crate, tetr-embed, that wraps the engine and the bot behind a wasm-bindgen Game: tick it at 60 Hz from requestAnimationFrame, feed it key events, read cells from a cached snapshot. A small Preact component draws those cells on a canvas. The payload is 114 KB of WebAssembly and about 44 KB of JavaScript, and the artifacts are committed into the blog’s repo, so the blog builds with plain Hugo and no Rust toolchain anywhere in its CI.

Two details I’m fond of. Determinism survives the FFI boundary: the same seed, handicap, and tick sequence produce byte-identical games in the embed, which makes browser bugs reproducible in a native test. And every number JavaScript passes in gets sanitized, because of a constraint the source comment states plainly: a panic across the wasm boundary aborts the entire module, “taking down every Game on the page.”

The take-a-turn mechanic costs almost nothing, structurally. The engine doesn’t know who’s driving it. Autoplay is the AI controller; clicking the button swaps in a keyboard controller fed from DOM key state; escape swaps back. Same engine, same rules, different hands.

What’s next

Versus is the obvious milestone: the engine already has garbage queues, cancellation, and attack tables, and the anytime-search architecture was shaped so a future bot can ponder continuously and re-root when garbage lands, instead of starting from scratch each piece.

Meanwhile the bot on the home page plays with a deliberate handicap: a human-ish reaction delay and a small error rate. It is beatable. If you want to try, there’s a board at the top of this post, another loitering on the home page , and the full game, with the menus and the bloom, lives here . Source for all of it is on GitHub .