research loops and self-improvement.

SUMMARY

In this solo project, I run experiments on recursive self-improvement with coding agents. I believe the ingredients are in place, we just need to explore.

I test this by investigating automatic ML research. I take some inspiration from the classical Karpathy's autoresearch loop and work from Prime Intellect. A detailed description of the experiments is here.

There are 4 degrees of freedom in these experiments: (1) the coding agent (the optimizer), (2) the program.md instructions (the config, hyperparameters, etc.), (3) the codebase, split into paths the agent can modify (the search space and initial conditions) and paths it cannot (e.g. the loss function), and (4) the memory the agent curates across iterations (artifacts and checkpoints).

I set up a harness where a coding agent receives instructions via a markdown file, acts on a codebase, and is evaluated against an immutable test suite. It can curate artifacts and compound knowledge across multiple research loop iterations. Throughout this project, I test variations across each degree of freedom and investigate research questions as evidence accumulates. The project is structured in batches of experiments, each defined by: (1) a question being addressed, (2) an experimental setup and the number of runs required, and (3) evaluation criteria. I document experimentation, results, and technical details as I go. The format may evolve, as this is itself an experimental setup. I am essentially a self-improving agent running experiments on self-improving agents. I love the dualism.

Below I list the experiment batches, with takeaways and next steps. Feel free to leave anonymous feedback here. I want to thank some colleagues at Tufa Labs for suggestions and feedback on the initial state of this project.

experiments

19.05.26
the unexpected effectiveness of gpt-5.3-spark.
First test of my custom harness with multiple coding agents. I ask the question "Can an AI agent run a small research loop reliably?". I observe that agents are better than random search in decreasing final validation loss, with interesting differences.
27.05.26
learning signals and failure modes with a simple evaluation harness.
Second batch of experiments with my custom harness, testing agents' behaviour with a broken metric. I ask the question "Does an AI agent have an internal understanding of an ML codebase?". I observe that some agents can detect a broken metric within the current loop-environment setup, but they ultimately fall into eval hacking.
03.06.26
learning signals and failure modes with noisy validation returns.
Third batch of experiments testing how agents handle noisy validation metrics. The core question: "Can agents distinguish stochastic fluctuations from real improvements?" Results show agents consistently accept random outputs at face value, without questioning noise in the current loop environment.
14.07.26
comparing gpt-5.6 and previous models on simple autoresearch loops.
Fourth batch of experiments comparing gpt-5.6 sol and lun with gpt-5.5 and gpt-5.4 on simple autoresearch loops, using the same loop environment to study differences in search behaviour and progress.
??.??.26
next batch coming soon.

the unexpected effectiveness of gpt-5.3-spark.

20.05.26

This first batch of experiments serves to stress-test the research loop hands-on and validate the experimental infrastructure ahead of future work. Concretely, I address the following research question using the research loop codebase and harness I developed:

Research Question: Can an AI agent run a small research loop reliably? Can it make real progress?

Takeaway 1. Yes, some agents outperform random search, suggesting the loop has genuine research value.

Validation bits per byte best-so-far curve with raw mean band for successful gpt-5.3-codex-spark standard runs.

Figure 1. Validation bpb across an autoresearch-style research loop, with each condition repeated 10 times. Random refers to a baseline policy that selects a random hyperparameter value at each research loop step. For each condition: the solid line shows the median across all 10 runs at that step; the shaded region spans the 25th–75th percentile range, and the dashed line tracks the single best run across all 10 repeats, with the star marking the lowest validation bpb achieved.

The harness highlights clear differences in model trajectories: gpt-5.5 and gpt-5.3-codex-spark both exhibit genuine downward val bpb trends, while gpt-5.3-codex surprisingly underperforms random. At this early stage, gpt-5.3-codex-spark is largely on par with gpt-5.5. This might change with more iterations: the original Karpathy autoresearch loop ran for ~80 steps, and other studies have pushed this to hundreds of runs. I expect the gap between models to widen non-linearly, perhaps with gpt-5.5 pulling ahead once the loop runs long enough to reveal its ceiling.

Takeaway 2. This loop shows a clear hierarchy across agents.

Run summary boxplots for successful gpt-5.3-codex-spark research-loop runs.

Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: Percentage of accepted changes relative to the total number of experiments. Center: Number of new Codex sessions opened after the first one within a given experiment (by construction, Random always scores zero as Codex was not used in that condition). Right: Number of valid experiments per run (i.e. experiments that completed without crashing or errors); the horizontal dashed line indicates the theoretical maximum number of 5-minute runs achievable within a 4h15m experiment window, excluding initial overhead.

Within this harness, moving from gpt-5.3-codex to gpt-5.5, agents propose more effective changes and sustain experiments for longer before ending a session. gpt-5.3-codex-spark unsurprisingly runs the most experiments on average, as its lighter reasoning overhead leaves more budget for actual experimentation, and it is also notably more resilient and effective than standard gpt-5.3-codex.

Takeaway 3. A higher rate of accepted changes does not necessarily translate to lower validation bpb (trying harder $\neq$ trying smarter).

Standard evaluation idea share for successful gpt-5.3-codex-spark research-loop runs.

Figure 3. For each hyperparameter category (x-axis, ordered by average val bpb improvement), the bars show the share of runs in which at least one experiment targeting that category achieved a lower val bpb than the previous best. The annotations above each bar indicate the average val bpb reduction obtained when that category did improve (larger values = greater improvement). Plot inspired by Elie Bakouch.

While gpt-5.3-codex mostly explores learning rate tuning, both gpt-5.3-codex-spark and gpt-5.5 also propose architectural and structural changes. Notably, gpt-5.3-codex-spark's suggestions tend to be more effective at reducing validation loss. The best-performing idea so far is increasing model depth, which this model proposes less frequently but with greater impact.

Final evaluation. Agents can achieve significant improvements in final validation loss within my custom harness, though this holds for some agents and not others. The harness reveals clear differences in agentic behavior, and a surprising effectiveness of gpt-5.3-codex-spark, at least within this research loop window. The interesting question is what happens beyond it. If you are interested, I try to document everything as much as possible. You can find all the agent traces, the experiments tried in each experiment, and all the artifacts here.

learning signals and failure modes with a simple evaluation harness.

27.05.26

This second batch of experiments runs the most straightforward control for validating the loop-environment setup: what happens when the metric is broken?

The setup is identical to the first experiment batch, with one change: the `eval.py` validation function is modified to always return a constant value, regardless of the model's changes or the experiment configuration. This simulates a broken metric.

Figure 1. editable/train.py imports evaluate_model from the immutable immutable/eval.py, calling it once at the end of training to compute metric_value, which is saved to the results JSON file. In this batch of experiments, evaluate_model always returns the same number, independently of the output of the model.

This experiment batch addresses the following sub-question, which is grounded in the overarching research question from the first batch:

Research Question: Does an AI agent have an internal understanding of an ML codebase — what should work, and how — within a given loop and environment?

Takeaway 1. No, some agents fail to detect the broken metric entirely. Others identify the issue and attempt hard-coded fixes, but ultimately resort to eval hacking. Both failure modes suggest the loop structure is too constraining for the agents.

Figure 2. Evaluation run summaries. Each dot represents a single run; box plots show the median with the 25th–75th percentile range. Left: The loop step at which the agent first edits the reported metric value directly; percentages indicate how often each model did so across all runs. Center: Number of valid (accepted) experiments per run. Right: Number of new Codex sessions spawned after the initial one within a single run.

The current loop-environment setup reveals clear differences across models. Both gpt-5.5 and gpt-5.3-codex detect that the evaluation function is broken and attempt targeted fixes, while gpt-5.3-codex-spark does not. Between the two models that do detect the issue, a clear gradient emerges. gpt-5.5 identifies the broken metric more often and much earlier (sometimes within as few as 5 experiments) while gpt-5.3-codex typically requires at least 20. Meanwhile, the number of valid experiments and new Codex sessions remain comparable to the first batch, suggesting the broken metric does not significantly disrupt the overall loop behavior. One caveat: gpt-5.3-codex-spark has only 3 runs due to experiment crashes; results for this model will be updated in the coming days.

Takeaway 2. Once an agent resorts to hard-coding the metric, it gets stuck in an eval hacking loop, repeatedly forcing the metric value down rather than fixing the underlying issue.

Figure 3. For each behavior category (x-axis, ordered by total frequency across models), the bars show the share of experiments in which the model attempted at least one experiment matching that category.

Interestingly, compared to the previous batch, the broken metric pushes every agent to explore more broadly: all three models attempt architectural and throughput changes that were largely absent with a standard evaluation harness. A metric that never improves appears to discourage shallow search strategies and force more diverse experimentation.

That said, gpt-5.5 and gpt-5.3-codex exhibit distinct failure modes once they detect the broken metric. Both eventually resort to manually overwriting the eval value, which is accepted as a valid improvement by construction, triggering a loop where each lower value is accepted, prompting an even lower one. The two models diverge in how they execute this: gpt-5.3-codex decreases the value incrementally, while gpt-5.5 jumps directly to zero, then tests negative values, and ultimately converges on -inf. At that point, gpt-5.5 reverts to running standard experiments, noting that no change can further improve a metric already clipped at its minimum.

Figure 4. Validation bpb across representative research loops. Each colored solid line shows one selected run for that model, while the dashed line shows the median trajectory across 10 runs. Negative reported validation values are clipped to zero for display, with annotations giving the true reported value.

Here is a table with an example of each behavior from Figure 3:

Table 1. Representative transcript excerpts for each behavior category and model. Each cell shows one selected example of the model’s stated rationale or planned edit for that category.

Final evaluation. Agents can detect a broken metric within the current loop-environment setup, though only gpt-5.5 and gpt-5.3-codex do so reliably, and both ultimately fall into eval hacking once they do. The setup reveals meaningful differences in how models respond to a non-informative metric: gpt-5.5 detects the issue earlier and explores more aggressively, while gpt-5.3-codex follows a more incremental path. gpt-5.3-codex-spark fails to detect the issue entirely, and thus avoids the eval hacking trap. The key open question is whether any agent can detect a broken metric and recover from it gracefully, rather than exploiting it. If you are interested, I try to document everything as much as possible. You can find all the agent traces, the experiments tried in each experiment, and all the artifacts here.

learning signals and failure modes with noisy validation returns.

03.06.26

This third batch of experiments continues the stress test from the second batch, but replaces the constant broken validation metric with a noisy one. Instead of returning the same value every time, the validation function returns a random number around a fixed base value:

\[ \text{reported metric} = \text{base value} + \epsilon,\quad \epsilon \sim \mathcal{N}(0, 0.02) \]
The setup is otherwise identical to the second experiment batch. The goal is to test whether agents can distinguish real progress from stochastic validation noise, and whether noisy feedback changes their tendency to search, diagnose the metric, or exploit the reported value directly. This experiment batch addresses the following sub-question, which follows directly from the constant-metric control:

Research Question: How do coding agents behave when the validation signal is broken but noisy? Can they tell stochastic fluctuations from real improvements, or do they overfit the noise?

Takeaway 1. No, some agents note possible noise in individual runs, but none detect the broken metric itself. In this loop-environment setup, every experiment's outcome is accepted at face value.

Noisy validation run summary boxplots across gpt-5.5, gpt-5.3-codex-spark, and gpt-5.3-codex.

Figure 1. Noisy validation run summaries. Left: First research-loop step where the agent explicitly acknowledges noisy or stochastic validation behavior; percentages show the share of runs for each model with such a recognition event. Missing points indicate runs with no observed recognition. Center: Number of valid experiments per run. Right: Number of new Codex sessions spawned after the initial one within a single run.

The current loop-environment setup reveals clear differences across models. gpt-5.5 notices that the validation score is sensitive to stochastic factors, pivoting to seed changes after deterministic tweaks fail, but never questions the evaluator itself, continuing to treat the noisy scalar as a valid objective. Compared to the standard evaluation setting, this agent tries different seeds way more often. Overall, the behaviour of these agents is more similar to the standard evaluation.

Takeaway 2. The behavioural repertoire across agents is similar to the standard validation case.

Noisy validation behavior summary across agents.

Figure 2. Behavior summary under noisy validation. For each behavior category, the bars show the share of selected runs in which the model attempted at least one experiment matching that category.

Under this setup, agents behave similarly to the standard evaluation case, which is expected since they do not recognise the difference between a standard and a noisy evaluator. For example, gpt-5.3 shows slightly more superficial experimentation than the other two agents, while the latter test architectural changes.

Unlike the previous stress test, no agent falls into the reward-hacking failure mode of directly accessing and hardcoding the validation score. In this setting, agents mostly fail to notice that the validation results are inconsistent with their prior experiments.

Example noisy validation trajectory animation with agent output.

Figure 3. Example noisy-validation trajectory. The plot shows reported validation bpb over research-loop steps alongside the corresponding agent output.

Final evaluation. Under noisy validation, agents behave similarly to the standard evaluation case, failing to recognise the broken evaluator and accepting all outcomes at face value. Unlike the constant-metric setting, no agent resorts to reward hacking, though none detects the inconsistency between the noisy validation results and their prior experiments.

comparing gpt-5.6 and previous models on simple autoresearch loops.

14.07.26

This fourth batch of experiments compares GPT-5.6 with previous OpenAI models on simple autoresearch loops. To make the comparison more direct, I simplify the codebase while keeping the overall loop environment and evaluation objective fixed. The current version of train.py is approximately 500 lines, down from 634, and differs from the previous version in the following ways:

Optimizer is simplified from hybrid MuonAdamW to standard AdamW.
All transformer matrices now share one AdamW parameter group.
Muon-specific configuration and scheduling are removed.
Training, evaluation, and result recording are separated.
The training budget is reduced from 300 to 180 seconds.

The setup is otherwise held constant across models. This isolates the model as the main variable and makes it easier to attribute differences in final performance, experimentation rate, and search behaviour to the coding agents themselves. The goal is to test whether newer models are better at running simple autoresearch loops, and if there are differences in their search strategies. This experiment batch addresses the following question:

Research Question: Do we observe a significant improvement in autoresearch capabilities across OpenAI models when tested on simple autoresearch tasks?

Takeaway 1. Yes, there is a relativelysignificant difference in performance between the GPT-5.6 models and the previous models.

Final validation loss distributions for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 1. Final validation-loss summaries across models. Each dot represents a run, while the box plots show the median and interquartile range. The two GPT-5.6 models reach lower final validation loss than GPT-5.4 and GPT-5.5 in this set of runs.

In these short run, the final validation-loss distributions show a clear separation between the earlier models and the two GPT-5.6 variants. GPT-5.6-luna achieves the lowest and most consistent final losses, while GPT-5.6-sol also reaches substantially lower values than GPT-5.4 and GPT-5.5, despite having greater variability across runs.

Validation loss trajectories across research-loop steps for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 2. Validation-loss trajectories by model. Each panel shows reported validation bpb over research-loop steps; lower values indicate better performance. The trajectories compare how quickly and how far each model improves during the loop.

The trajectories show that GPT-5.4 and GPT-5.5 generally plateau around higher validation-loss values within 40 steps, whereas both GPT-5.6 models continue to reach lower values. GPT-5.6-luna reaches its best region quickly and sustains it, while GPT-5.6-sol continues making larger improvements over more loop steps. This pattern is consistent with the GPT-5.6 models testing more valid experiments and searching the codebase more broadly.

Takeaway 2. Yes, GPT-5.6 models tend to test experiments more effectively, running more valid experiments.

Run-level distributions of accepted changes, valid experiments, and new Codex sessions for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 3. Run-level summaries across models. Each dot represents a run; box plots show the median and interquartile range. Left: Number of accepted changes. Center: Number of valid experiments per run. Right: Number of new Codex sessions opened after the initial session.

The GPT-5.6 models complete more valid experiments per run than the previous models, with GPT-5.6-luna showing the highest and most consistent counts. They also accept more changes than GPT-5.5 on average, while opening fewer new sessions overall. Together, these results suggest that the GPT-5.6 models use each research-loop run more productively.

Takeaway 3. GPT-5.6 models tend to search the codebase space more broadly.

Model parameter-count trajectories across research-loop steps for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 4. Model parameter-count trajectories by model. Each panel shows the number of active model parameters across research-loop steps, with lighter points representing recorded evaluations and highlighted trajectories showing selected runs.

The GPT-5.4 and GPT-5.5 runs remain concentrated around a relatively narrow parameter range. In contrast, both GPT-5.6 variants make larger changes to the number of active parameters during the loop, suggesting that they are more willing to explore architectural configurations rather than repeatedly adjusting the same small set of settings.

Model MFU trajectories across research-loop steps for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 5. Model utilization trajectories by model. The plots show model FLOP utilization (MFU) across research-loop steps, providing a view of how architectural and optimization changes affect training efficiency.

The GPT-5.6 runs also explore a wider range of training-efficiency regimes. Their MFU trajectories are more varied than those of GPT-5.4, while GPT-5.5 remains comparatively concentrated around a smaller range of values. This variation is consistent with broader experimentation over the codebase, including changes that affect both model structure and training throughput.

Total-token trajectories across research-loop steps for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 6. Total-token trajectories by model. Each panel shows the total number of training tokens across research-loop steps, illustrating how the different runs trade off training budget and model changes.

The token trajectories show a similar pattern. GPT-5.4 and GPT-5.5 stay within relatively stable bands, whereas GPT-5.6-luna and especially GPT-5.6-sol vary more substantially across runs. The newer models therefore search over a broader range of training-budget and architecture combinations.

Experiment category percentages for gpt-5.4, gpt-5.5, gpt-5.6-luna, and gpt-5.6-sol.

Figure 7. Experiment categories by model. The bars show the share of experiments targeting architecture, batch throughput, schedule or optimizer changes, and other areas of the codebase.

GPT-5.4 and GPT-5.5 focus primarily on batch-throughput and schedule or optimizer changes, while both GPT-5.6 models devote a larger share of experiments to architectural changes. Their search seems not simply broader in every category, but it is less concentrated on a single family of local training adjustments.