As someone who is a fervent adherent to automated testing, I should have known better anyway, but for fairly complex vibecoded projects of mine, I found myself neglecting to add tests at some point. My intuition came from a place of accepting loss of control to some degree, leaning into it, one might say. After all, there are people advocating for exactly that; notoriously, Yegge. But also people arguing for giving up on PR reviews — the new bottleneck. Be that as it may, my post-hoc reasoning was roughly that I’d have to at least look at and curate the tests, if not the code directly. And since I had no wish to do that, I might as well leave it be.
Courtesy Simon Willison for setting me back on track, so to speak. He was basically saying that it simply helps keep the agent in check, when it touches a thing and this breaks another thing — basically the same situation as when two teams touch the same codebase: regression testing.
Here’s a bit of additional reasoning of mine: even if both tests and code have been written by an agent, there is a kind of triangulation happening, if the code is known and proven — by usage — to be stable and bug-free. The idea is the following: you know (provided you did red/green style checking, as also suggested by Willison, and is good practice) by using the app that it works. And the tests reflect that working state of the system. So if the tests fail, you really should expect some part of the application to be in trouble, even if you’ve never seen the tests up close.
Also, I found Opus 4.6 had this tendency to change that one particular spot in my codebase whenever I touched it. My ideas of what I wanted to achieve there were obviously were a little different from what it knew from its training data. Tests, and this is known, also can communicate intent. So in this case some tests help cement the desired behaviour. This case, then, clearly asks for manual curation of the test.