research loops and self-improvement.



SUMMARY

In this solo project, I run experiments on recursive self-improvement with coding agents. I believe the ingredients are in place, we just need to explore.

I test this investigating automatic ML research. I take some inspiration from the classical Karpathy's autoresearch loop and work from Prime Intellect. A detailed description of the experiments is here.
There are 4 degrees of freedom in these experiments: (1) the coding agent (the optimizer), (2) the program.md instructions (the config, hyperparameters, etc.), (3) the codebase, split into paths the agent can modify (the search space and initial conditions) and paths it cannot (e.g. the loss function), and (4) the memory the agent curates across iterations (artifacts and checkpoints).
I set up a harness where a coding agent receives instructions via a markdown file, acts on a codebase, and is evaluated against an immutable test suite. It can curate artifacts and compound knowledge across multiple research loop iterations. Throughout this project, I test variations across each degree of freedom and investigate research questions as evidence accumulates. The project is structured in batches of experiments, each defined by: (1) a question being addressed, (2) an experimental setup and the number of runs required, and (3) evaluation criteria. I document experimentation, results, and technical details as I go. The format may evolve, as this is itself an experimental setup. I am essentially a self-improving agent running experiments on self-improving agents. I love the dualism.

Below I list the experiment batches, with takeaways and next steps. Feel free to leave anonymous feedback here. I want to thank some colleagues at Tufa Labs for suggestions and feedbacks on the initial state of this project.

the unexpected effectiveness of gpt-5.3-spark.
19.05.26
X

This first batch of experiments serves to stress-test the research loop hands-on and validate the experimental infrastructure ahead of future work. Concretely, I address the following research question using the research loop codebase and harness I developed:
Research Question: can an AI agent run a small research loop reliably? Can it make real progress?
Takeaway 1. Yes, some agents outperform random search, suggesting the loop has genuine research value.
Validation bits per byte best-so-far curve with raw mean band for successful gpt-5.3-codex-spark standard runs.
Figure 1. Validation bpb across an autoresearch-style research loop, with each condition repeated 10 times. Random refers to a baseline policy that selects a random hyperparameter value at each research loop step. For each condition: the solid line shows the median across all 10 runs at that step; the shaded region spans the 25th–75th percentile range, and the dashed line tracks the single best run across all 10 repeats, with the star marking the lowest validation bpb achieved.
The harness highlights clear differences in model trajectories: gpt-5.5 and gpt-5.3-codex-spark both exhibit genuine downward val bpb trends, while gpt-5.3-codex surprisingly underperforms random. At this early stage, gpt-5.3-codex-spark is largely on par with gpt-5.5. This might change with more iterations: the original Karpathy autoresearch loop ran for ~80 steps, and other studies have pushed this to hundreds of runs. I expect the gap between models to widen non-linearly. Perhaps with gpt-5.5 pulling ahead once the loop runs long enough to reveal its ceiling.

Takeaway 2. This loop shows a clear hierarchy across agents.
Run summary boxplots for successful gpt-5.3-codex-spark research-loop runs.
Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: Percentage of accepted changes relative to the total number of experiments. Center: Number of new Codex sessions opened after the first one within a given experiment (by construction, Random always scores zero as Codex was not used in that condition). Right: Number of valid experiments per run (i.e. experiments that completed without crashing or errors); the horizontal dashed line indicates the theoretical maximum number of 5-minute runs achievable within a 4h15m experiment window, excluding initial overhead.
Within this harness, moving from gpt-5.3-codex to gpt-5.5, agents propose more effective changes and sustain experiments for longer before ending a session. gpt-5.3-codex-spark unsurprisingly runs the most experiments on average, as its lighter reasoning overhead leaves more budget for actual experimentation, and it is also notably more resilient and effective than standard gpt-5.3-codex.

Takeaway 3. A higher rate of accepted changes does not necessarily translate to lower validation bpb (trying harder $\neq$ trying smarter).
Standard evaluation idea share for successful gpt-5.3-codex-spark research-loop runs.
Figure 3. For each hyperparameter category (x-axis, ordered by average val bpb improvement), the bars show the share of runs in which at least one experiment targeting that category achieved a lower val bpb than the previous best. The annotations above each bar indicate the average val bpb reduction obtained when that category did improve (larger values = greater improvement). Plot inspired by Elie Bakouch.
While gpt-5.3-codex mostly explores learning rate tuning, both gpt-5.3-codex-spark and gpt-5.5 also propose architectural and structural changes. Notably, gpt-5.3-codex-spark's suggestions tend to be more effective at reducing validation loss. The best-performing idea so far is increasing model depth, which this model proposes less frequently but with greater impact.

Final evaluation. Agents can achieve significant improvements in final validation loss within my custom harness, though this holds for some agents and not others. The harness reveals clear differences in agentic behavior, and a surprising effectiveness of gpt-5.3-codex-spark, at least within this research loop window. The interesting question is what happens beyond it. If you are interested, I try to document everything as max as possible. You find all the agent traces, the experiments tried in eeach experiments, and all the artifacts in here.


coming soon.
??.??.26
X

Testing agents in new search spaces.