theory building evaluates generative AI

reflecting on Peter Naur's Programming as Theory Building

3 minute read Published: 14 Feb, 2026

In 1985, Peter Naur wrote a fascinating paper called Programming as Theory Building. The paper’s central thesis is that the core value of programming is not the actual program or its associated documentation, but rather the theory the programmers constructed while writing that program to solve the problem. There’s a few other weird corollaries, such as saying that a program is only good as long as the original writer is working on the program, or that scientific method (yes, with the article) is bogus. OK fine.

But what about theory building in the new age of generative AI?

First, a note. A theory is a set of axioms capable of generating new findings with justification that is based on the axioms. So theory = axioms + generation procedure.

Here’s my claim: Naur’s paper provides an excellent outline of how to write a rubric to grade generative AI. Naur says the value of programming is in the intellectual activity that leads to the program, with the summary of that activity being the theory about how to solve the problem. Then the program is just an imperfect symbolization of the theory. To test the theory’s ability to solve the problem, the theory must satisfy five general criteria:

accuracy: does the theory provide a solution satisfying the requirements?
applicability: does the theory work on a the chosen problem and its close proximity?
flexibility: is the theory able to cover a wider range of problems?
adaptability: can the theory be mutated to handle newly discovered or otherwise unsupported problems?
explainability: can the theory explain the solution to any supported problem from that theory’s princples?

I would say that is a great way to break down how to evaluate a generative AI solution. Evaluation has always been hard but at least now we can say that we target each of these five more specific aspects of the outputs of generative AI. For example, we can use variants of fuzz testing to check flexibility. We can change data sources as inputs to test adaptability. We can use sample golden datasets for accuracy. See how once we have these five, we can go from “evaluate this” to “come up with questions targeting each of these five”?

What I like about this philosophical breakdown of evaluation is its ability to put more direct words on fuzzy topics like “is AI smart?”. The engineer in me can now say that I am buliding an agent that works in an environment and that in these five categories, my agent is this intelligent. If my agent is great at all five, then I don’t just have a generative AI solution. I have an intelligent solution.

Published by 14 Feb, 2026 using 444 words.