research loops and self-improvement.



SUMMARY

In this solo project, I run experiments on recursive self-improvement with coding agents. I believe the ingredients are in place, we just need to explore.

I test this by investigating automatic ML research. I take some inspiration from the classical Karpathy's autoresearch loop and work from Prime Intellect. A detailed description of the experiments is here.
There are 4 degrees of freedom in these experiments: (1) the coding agent (the optimizer), (2) the program.md instructions (the config, hyperparameters, etc.), (3) the codebase, split into paths the agent can modify (the search space and initial conditions) and paths it cannot (e.g. the loss function), and (4) the memory the agent curates across iterations (artifacts and checkpoints).
I set up a harness where a coding agent receives instructions via a markdown file, acts on a codebase, and is evaluated against an immutable test suite. It can curate artifacts and compound knowledge across multiple research loop iterations. Throughout this project, I test variations across each degree of freedom and investigate research questions as evidence accumulates. The project is structured in batches of experiments, each defined by: (1) a question being addressed, (2) an experimental setup and the number of runs required, and (3) evaluation criteria. I document experimentation, results, and technical details as I go. The format may evolve, as this is itself an experimental setup. I am essentially a self-improving agent running experiments on self-improving agents. I love the dualism.

Below I list the experiment batches, with takeaways and next steps. Feel free to leave anonymous feedback here. I want to thank some colleagues at Tufa Labs for suggestions and feedback on the initial state of this project.

the unexpected effectiveness of gpt-5.3-spark.
20.05.26
X

This first batch of experiments serves to stress-test the research loop hands-on and validate the experimental infrastructure ahead of future work. Concretely, I address the following research question using the research loop codebase and harness I developed:
Research Question: Can an AI agent run a small research loop reliably? Can it make real progress?
Takeaway 1. Yes, some agents outperform random search, suggesting the loop has genuine research value.
Validation bits per byte best-so-far curve with raw mean band for successful gpt-5.3-codex-spark standard runs.
Figure 1. Validation bpb across an autoresearch-style research loop, with each condition repeated 10 times. Random refers to a baseline policy that selects a random hyperparameter value at each research loop step. For each condition: the solid line shows the median across all 10 runs at that step; the shaded region spans the 25th–75th percentile range, and the dashed line tracks the single best run across all 10 repeats, with the star marking the lowest validation bpb achieved.
The harness highlights clear differences in model trajectories: gpt-5.5 and gpt-5.3-codex-spark both exhibit genuine downward val bpb trends, while gpt-5.3-codex surprisingly underperforms random. At this early stage, gpt-5.3-codex-spark is largely on par with gpt-5.5. This might change with more iterations: the original Karpathy autoresearch loop ran for ~80 steps, and other studies have pushed this to hundreds of runs. I expect the gap between models to widen non-linearly, perhaps with gpt-5.5 pulling ahead once the loop runs long enough to reveal its ceiling.

Takeaway 2. This loop shows a clear hierarchy across agents.
Run summary boxplots for successful gpt-5.3-codex-spark research-loop runs.
Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: Percentage of accepted changes relative to the total number of experiments. Center: Number of new Codex sessions opened after the first one within a given experiment (by construction, Random always scores zero as Codex was not used in that condition). Right: Number of valid experiments per run (i.e. experiments that completed without crashing or errors); the horizontal dashed line indicates the theoretical maximum number of 5-minute runs achievable within a 4h15m experiment window, excluding initial overhead.
Within this harness, moving from gpt-5.3-codex to gpt-5.5, agents propose more effective changes and sustain experiments for longer before ending a session. gpt-5.3-codex-spark unsurprisingly runs the most experiments on average, as its lighter reasoning overhead leaves more budget for actual experimentation, and it is also notably more resilient and effective than standard gpt-5.3-codex.

Takeaway 3. A higher rate of accepted changes does not necessarily translate to lower validation bpb (trying harder $\neq$ trying smarter).
Standard evaluation idea share for successful gpt-5.3-codex-spark research-loop runs.
Figure 3. For each hyperparameter category (x-axis, ordered by average val bpb improvement), the bars show the share of runs in which at least one experiment targeting that category achieved a lower val bpb than the previous best. The annotations above each bar indicate the average val bpb reduction obtained when that category did improve (larger values = greater improvement). Plot inspired by Elie Bakouch.
While gpt-5.3-codex mostly explores learning rate tuning, both gpt-5.3-codex-spark and gpt-5.5 also propose architectural and structural changes. Notably, gpt-5.3-codex-spark's suggestions tend to be more effective at reducing validation loss. The best-performing idea so far is increasing model depth, which this model proposes less frequently but with greater impact.

Final evaluation. Agents can achieve significant improvements in final validation loss within my custom harness, though this holds for some agents and not others. The harness reveals clear differences in agentic behavior, and a surprising effectiveness of gpt-5.3-codex-spark, at least within this research loop window. The interesting question is what happens beyond it. If you are interested, I try to document everything as much as possible. You can find all the agent traces, the experiments tried in each experiment, and all the artifacts here.




learning signals and failure modes with a simple evaluation harness.
27.05.26
X

This second batch of experiments runs the most straightforward control for validating the loop-environment setup: what happens when the metric is broken?

The setup is identical to the first experiment batch, with one change: the `eval.py` validation function is modified to always return a constant value, regardless of the model's changes or the experiment configuration. This simulates a broken metric.
Figure 1. editable/train.py imports evaluate_model from the immutable immutable/eval.py, calling it once at the end of training to compute metric_value, which is saved to the results JSON file. In this batch of experiments, evaluate_model always returns the same number, independently of the output of the model.
This experiment batch addresses the following sub-question, which is grounded in the overarching research question from the first batch:
Research Question: Does an AI agent have an internal understanding of an ML codebase — what should work, and how — within a given loop and environment?
Takeaway 1. No, some agents fail to detect the broken metric entirely. Others identify the issue and attempt hard-coded fixes, but ultimately resort to eval hacking. Both failure modes suggest the loop structure is too constraining for the agents.
Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: The loop step at which the agent first edits the reported metric value directly; percentages indicate how often each model did so across all runs. Center: Number of valid (accepted) experiments per run. Right: Number of new Codex sessions spawned after the initial one within a single run.
The current loop-environment setup reveals clear differences across models. Both gpt-5.5 and gpt-5.3-codex detect that the evaluation function is broken and attempt targeted fixes, while gpt-5.3-codex-spark does not. Between the two models that do detect the issue, a clear gradient emerges. gpt-5.5 identifies the broken metric more often and much earlier (sometimes within as few as 5 experiments) while gpt-5.3-codex typically requires at least 20. Meanwhile, the number of valid experiments and new Codex sessions remain comparable to the first batch, suggesting the broken metric does not significantly disrupt the overall loop behavior. One caveat: gpt-5.3-codex-spark has only 3 runs due to experiment crashes; results for this model will be updated in the coming days.

Takeaway 2. Once an agent resorts to hard-coding the metric, it gets stuck in an eval hacking loop, repeatedly forcing the metric value down rather than fixing the underlying issue.
S
Figure 3. For each behavior category (x-axis, ordered by total frequency across models), the bars show the share of experiments in which the model attempted at least one experiment matching that category.
Interestingly, compared to the previous batch, the broken metric pushes every agent to explore more broadly: all three models attempt architectural and throughput changes that were largely absent with a standard evaluation harness. A metric that never improves appears to discourage shallow search strategies and force more diverse experimentation.

That said, gpt-5.5 and gpt-5.3-codex exhibit distinct failure modes once they detect the broken metric. Both eventually resort to manually overwriting the eval value, which is accepted as a valid improvement by construction, triggering a loop where each lower value is accepted, prompting an even lower one. The two models diverge in how they execute this: gpt-5.3-codex decreases the value incrementally, while gpt-5.5 jumps directly to zero, then tests negative values, and ultimately converges on -inf. At that point, gpt-5.5 reverts to running standard experiments, noting that no change can further improve a metric already clipped at its minimum.
S
Figure 4. Validation bpb across representative research loops. Each colored solid line shows one selected run for that model, while the dashed line shows the median trajectory across 10 runs. Negative reported validation values are clipped to zero for display, with annotations giving the true reported value.
Here is a table with an example of each behavior from Figure 3:
S
Table 1. Representative transcript excerpts for each behavior category and model. Each cell shows one selected example of the model’s stated rationale or planned edit for that category.
Final evaluation. Agents can detect a broken metric within the current loop-environment setup, though only gpt-5.5 and gpt-5.3-codex do so reliably, and both ultimately fall into eval hacking once they do. The setup reveals meaningful differences in how models respond to a non-informative metric: gpt-5.5 detects the issue earlier and explores more aggressively, while gpt-5.3-codex follows a more incremental path. gpt-5.3-codex-spark fails to detect the issue entirely, and thus avoids the eval hacking trap. The key open question is whether any agent can detect a broken metric and recover from it gracefully, rather than exploiting it. If you are interested, I try to document everything as much as possible. You can find all the agent traces, the experiments tried in each experiment, and all the artifacts here.




learning signals and failure modes with noisy validation returns.
03.06.26

This third batch of experiments continues the stress test from the second batch, but replaces the constant broken validation metric with a noisy one. Instead of returning the same value every time, the validation function returns a random number around a fixed base value:

\[ \text{reported metric} = \text{base value} + \epsilon,\quad \epsilon \sim \mathcal{N}(0, 0.02) \]
The setup is otherwise identical to the second experiment batch. The goal is to test whether agents can distinguish real progress from stochastic validation noise, and whether noisy feedback changes their tendency to search, diagnose the metric, or exploit the reported value directly. This experiment batch addresses the following sub-question, which follows directly from the constant-metric control:
Research Question: How do coding agents behave when the validation signal is broken but noisy? Can they tell stochastic fluctuations from real improvements, or do they overfit the noise?
Takeaway 1. No, some agents note possible noise in individual runs, but none detect the broken metric itself. In this loop-environment setup, every experiment's outcome is accepted at face value.
Noisy validation run summary boxplots across gpt-5.5, gpt-5.3-codex-spark, and gpt-5.3-codex.
Figure 1. Noisy validation run summaries. Left: First research-loop step where the agent explicitly acknowledges noisy or stochastic validation behavior; percentages show the share of runs for each model with such a recognition event. Missing points indicate runs with no observed recognition. Center: Number of valid experiments per run. Right: Number of new Codex sessions spawned after the initial one within a single run.
The current loop-environment setup reveals clear differences across models. gpt-5.5 notices that the validation score is sensitive to stochastic factors, pivoting to seed changes after deterministic tweaks fail, but never questions the evaluator itself, continuing to treat the noisy scalar as a valid objective. Compared to the standard evaluation setting, this agent tries different seeds way more often. Overall, the behaviour of these agents is more similar to the standard evaluation.

Takeaway 2. The behavioural repertoire across agents is similar to the standard validation case.
Noisy validation behavior summary across agents.
Figure 2. Behavior summary under noisy validation. For each behavior category, the bars show the share of selected runs in which the model attempted at least one experiment matching that category.
Under this setup, agents behave similarly to the standard evaluation case, which is expected since they do not recognise the difference between a standard and a noisy evaluator. For example, gpt-5.3 shows slightly more superficial experimentation than the other two agents, while the latter test architectural changes.

Unlike the previous stress test, no agent falls into the reward-hacking failure mode of directly accessing and hardcoding the validation score. In this setting, agents mostly fail to notice that the validation results are inconsistent with their prior experiments.

Example noisy validation trajectory animation with agent output.
Figure 3. Example noisy-validation trajectory. The plot shows reported validation bpb over research-loop steps alongside the corresponding agent output.
Final evaluation. Under noisy validation, agents behave similarly to the standard evaluation case, failing to recognise the broken evaluator and accepting all outcomes at face value. Unlike the constant-metric setting, no agent resorts to reward hacking, though none detects the inconsistency between the noisy validation results and their prior experiments.