We need a formal theory of Agent Evals

Historically, we've distinguished between two primary computational artifacts:

Programs: Symbolic instruction followers. Focus area of Theoretical CS.
ML Models: Statistical pattern-matchers and function approximators, increasingly incorporating computational structures (e.g., Neural Turing Machines). Studied in Probability and Statistics.

But there is another kind of unit that has never been mathematically modeled and studied, because it has never been feasible to simulate one: The Programmer! Till recently, only humans could create programs or ML models from requirements.

But as AlphaEvolve and Darwin-Godel Machine demonstrate, coding agents are a superset of both programs and ML models. And with the ability to generate new programs and new ML models, agents can also generate new agents!

I think we should start thinking of Agents as the base unit of neurosymbolic computation. While they might not exceed human programmers just yet, Agents do generalize and unify programs, ML models, and even programmers - by demonstrating iterative refinement, problem decomposition, and novel solution generation - into a single construct.

Think about this: Agent is the first neurosymbolic computation unit that can generate more capable units than itself!

Compilers translate programs, not generate them from requirements).

The reason I want us to think in this framing is because our theories and techniques for verifying the performance characteristics of programs and ML models will fall short when it comes to agents.

Evaluation of Programs: Types and Tests

For symbolic programs, Lambda Calculus provided the foundational model. To manage its initial semantic permissiveness, Typed Lambda Calculus introduced contracts, offering partial, static guarantees about program behavior (e.g., string -> string). This evolved significantly:

Dependent Types (e.g., in Lean) allowed types to depend on values ((n: int) -> Vector<int, n>), enabling much stronger compile-time proofs.
Propositions-as-Types (Curry-Howard correspondence) equated proofs with programs, turning proof-checking into type-checking.
Quantitative and Linear Types (e.g., in Idris 2) have begun to model resource usage as the typechecker can now ensure that a value is used 0, 1 or more times - as declared.

But type theory has many limitations:

No type system can verify all program properties statically.
More importantly, much of a function's actual behavior (e.g., sort_list sorting ascending vs. descending) is conveyed by its name or documentation, not its type alone. If the sort_list functions returns the input list as-is, or just reverses the list, the type-checker would happily pass!

To bridge this gap, Tests (unit, integration, etc.) became the pragmatic standard for dynamic verification. While tests cannot prove the absence of bugs (Dijkstra), they cover boundary conditions and offer stronger practical guarantees than types alone.

But Functional test scores (often pass/fail, or aggregated percentages) are generally not structured to directly drive automated, gradient-like optimization of the program's design or underlying code. And tests for other criteria like CPU/Memory usage, resource consumption like disk/network and dependencies on other computational units (libraries, devices, infra components etc) often remain separate activities.

Evaluation of ML Models: Benchmarks

ML models approximate some unknown function based on a partially seen data distribution. The goal is for them to perform well on unseen data from the same distribution.

Because their behavior is learned and not explicitly programmed rule-by-rule, their performance isn't fully predictable in the same way as a traditional symbolic program. So, we use Benchmarks. We take a sample of unseen data, run the model on it, and calculate various metrics: accuracy, precision, recall, F1-score for classification; perplexity, BLEU score for language models; mean squared error for regression, etc. These scores give us a statistical measure of how well the model is likely to perform in the real world.

Why We Need a Formal Theory of Agent Evals

With code-writing agents, we finally have a computational unit that is itself capable of creating more such units. Programs could not create arbitrary programs. ML models could not create ML models. But Agents can create Agents. This is groundbreaking!

If Agents are becoming these powerful, even self-modifying entities, the separate techniques we use for evaluating programs (types, tests) and ML models (metric-based evals) start to feel insufficient for the combined system.

It's become critically important to systematically investigate and invent a formal theory of evals specifically for these Agents. We need a principled way to understand, predict, and guarantee their behavior - especially across quality (accuracy as well as other probabilistic scores) and cost (constraints, resource usage, time/space complexity etc).

Evals: The Grand Unifier?

Here's my central hypothesis: Types ⊂ Tests ⊂ Evals.

Think about it:

Types are like static evals. A type-checker "evaluates" your code against type rules and gives a binary score: pass or fail. If it passes, you have certain guarantees.
Tests are dynamic evals. You run the code (or a component) and evaluate its output against expected values. Test suites often produce a score (e.g., 95% tests passing).
ML Evals (accuracy, BLEU, etc.) are already score-based evaluations of model behavior on data.

A formal theory of Agent Evals could provide a unified framework. Whether it's checking if a generated piece of code adheres to a type contract, if a program passes a specific behavioral test, or if an LLM's natural language output meets a quality threshold, these can all be seen as different facets of "evaluating" the Agent's output or behavior.

This unification could go even further. Traditional software development also cares about resource usage:

FLOPS (Floating Point Operations Per Second)
Memory consumption
Disk I/O
Network bandwidth
Dependencies

These are all, in essence, metrics we evaluate. When we consider a program or ML model for deployment, we consider aspects like cost, latency and dependencies. A comprehensive theory of Agent Evals could provide a structured way to define, measure, and optimize for all these behavioral and resource metrics, whether the component being evaluated is purely symbolic, purely neural, or a complex hybrid generated by an Agent.

This isn't just about "pass/fail" anymore. It's about a rich, multi-dimensional scoring of an Agent's performance across various criteria. In fact, this kind multi-dimensional scoring is precisely what makes AlphaEvolve and Darwin-Godel Machine work.

What Could a Theory of Agent Evals Unlock?

If we had such a theory, what could we do?

Automated Evolutionary Search & Synthesis: Imagine a vast database of programs, ML models, and pre-existing Agents, all annotated with their comprehensive eval scores across numerous dimensions. With a formal way to define desired outcomes (as target eval scores), we could potentially unleash fully automated evolutionary search or synthesis processes. These processes would intelligently combine, modify, and generate new Agents (and thus programs and models) to find optimal solutions for complex tasks, balancing various performance criteria.
Beyond Time and Space Complexity: Traditional computer science gives us Big O notation for time and space complexity. This is invaluable. But a theory of Agent Evals could enable a much richer, automated analysis. We could analyze for "Quality Complexity" (how does the quality of output degrade with input complexity?), "Cost Complexity" (how do cloud costs scale?), "Latency Profiles," and adherence to arbitrary "Constraint Manifolds" (e.g., this Agent must never use more than X memory AND respond within Y milliseconds AND achieve Z accuracy).
Dual-Mode Execution: Symbolic & Differentiable: This is a particularly exciting frontier. The paper "The Elements of Differentiable Programming" (a great read, by the way!) explores how programs can be made differentiable. If our Agents can generate code, and if our Eval theory is rich enough, we could potentially have the same agent-generated code be executed in two modes:
- Traditional Symbolic Mode: For fast, precise execution when that's what's needed.
- Differentiable Mode: Where control flow elements (ifs, loops) and data structures are interpreted as their differentiable counterparts. This would allow the Agent's program-generation capabilities to be optimized via gradient descent, just like we train neural networks.

The capabilities of Agents are exploding. They're starting to write their own software, compose complex systems, and tackle problems in ways that blur the lines between traditional programming and machine learning.

To harness this power responsibly and effectively, we can't just keep bolting on our old evaluation methods. We need to step back, think deeply, and develop a formal theory of Agent Evals. It's a challenging endeavor, no doubt, but the potential payoffs – more robust, reliable, optimized, and even self-improving AI systems – are immense.

What do you think? You can message me on Twitter.

PreviousIdeas NextSkills young programmers in India need to build in 2025

Last updated 2 months ago