research loops and self-improvement.
SUMMARY
In this solo project, I run experiments on recursive self-improvement with coding agents. I believe the ingredients are in place, we just need to explore.
I test this investigating automatic ML research. I take some inspiration from the classical Karpathy's autoresearch loop and work from Prime Intellect. A detailed description of the experiments is here.

There are 4 degrees of freedom in these experiments: (1) the coding agent (the optimizer), (2) the program.md instructions (the config, hyperparameters, etc.), (3) the codebase, split into paths the agent can modify (the search space and initial conditions) and paths it cannot (e.g. the loss function), and (4) the memory the agent curates across iterations (artifacts and checkpoints).
Below I list the experiment batches, with takeaways and next steps. Feel free to leave anonymous feedback here. I want to thank some colleagues at Tufa Labs for suggestions and feedbacks on the initial state of this project.
This first batch of experiments serves to stress-test the research loop hands-on and validate the experimental infrastructure ahead of future work. Concretely, I address the following research question using the research loop codebase and harness I developed:
Research Question: can an AI agent run a small research loop reliably? Can it make real progress?
Takeaway 1. Yes, some agents outperform random search, suggesting the loop has genuine research value.

Figure 1. Validation bpb across an autoresearch-style research loop, with each condition repeated 10 times. Random refers to a baseline policy that selects a random hyperparameter value at each research loop step. For each condition: the solid line shows the median across all 10 runs at that step; the shaded region spans the 25th–75th percentile range, and the dashed line tracks the single best run across all 10 repeats, with the star marking the lowest validation bpb achieved.
Takeaway 2. This loop shows a clear hierarchy across agents.

Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: Percentage of accepted changes relative to the total number of experiments. Center: Number of new Codex sessions opened after the first one within a given experiment (by construction, Random always scores zero as Codex was not used in that condition). Right: Number of valid experiments per run (i.e. experiments that completed without crashing or errors); the horizontal dashed line indicates the theoretical maximum number of 5-minute runs achievable within a 4h15m experiment window, excluding initial overhead.
Takeaway 3. A higher rate of accepted changes does not necessarily translate to lower validation bpb (trying harder $\neq$ trying smarter).

Figure 3. For each hyperparameter category (x-axis, ordered by average val bpb improvement), the bars show the share of runs in which at least one experiment targeting that category achieved a lower val bpb than the previous best. The annotations above each bar indicate the average val bpb reduction obtained when that category did improve (larger values = greater improvement). Plot inspired by Elie Bakouch.
Final evaluation. Agents can achieve significant improvements in final validation loss within my custom harness, though this holds for some agents and not others. The harness reveals clear differences in agentic behavior, and a surprising effectiveness of gpt-5.3-codex-spark, at least within this research loop window. The interesting question is what happens beyond it. If you are interested, I try to document everything as max as possible. You find all the agent traces, the experiments tried in eeach experiments, and all the artifacts in here.
Testing agents in new search spaces.