Nothing fundamentally prevents an LLM from achieving this. You can ask an LLM to produce a PR, another LLM to review a PR, and another LLM to critique the review, then another LLM to question the original issue's validity, and so on...
The reason LLM is such a big deal is that they are humanity's first tool that is general enough to support recursion (besides humans of course.) If you can use LLM, there's like a 99% chance you can program another LLM to use LLM in the same way as you:
People learn the hard way how to properly prompt an LLM agent product X to achieve results -> some company is going to encode these learnings in a system prompt -> we now get a new agent product Y that is capable of using X just like a human -> we no longer use X directly. Instead, we move up one level in the command chain, to use product Y instead. And this recursion goes on and on, until the world doesn't have any level left for us to go up to.
We are basically seeing this play out in realtime with coding agents in the past few months.
I assume you ignored "teleology" because you concede the point, otherwise feel free to take it.
" Is there an “inventiveness test” that humans can pass but LLMs don’t?"
Of course, any topic where there is no training data available and that cannot be extrapolated by simply mixing the existing data. Of course that is harder to test on current unknowns and unknown unknowns.
But it is trivial to test on retrospective knowledge. Just train the AI with text say to the 1800 and see if it can come out with antibiotics and general relativity, or if it will simply repeat outdated notions of disease theory and newtonian gravity.
I don't think it will settle things even if we did manage to train an 1800 LLM with sufficient size.
LLMs are blank slates (like an uncultured primitive human being - albeit LLM comes with knowledge built-in, but builtin knowledge is mostly irrelevant here). LLM output is purely a function of the input (context), so agentic systems' capabilities do not equal underlying LLM's capabilities.
If you ask such an LLM "overturn Newtonian physics, come up with a better theory", of course the LLM won't give you relativity just like that. The same way an uneducated human has no chance of coming up with relativity either.
However, ask it this:
```
You are Einstein ...
<omitted: 10 million tokens establishing Einstein's early life and learnings>
... Recent experiments have put these ideas to doubt, ...<another bunch of tokens explaining the Michelson–Morley experiment>... Any idea why this occurs?
```
and provide it with tools to find books, speak with others, run experiments, etc. Conceivably, the result will be different.
Again, we pretty much see this play out in coding agents:
Claude the LLM has no prior knowledge of my codebase so of course it has zero chance of solving a bug in it. Claude 4 is a blank slate.
Claude Code the agentic system can:
- look at a screenshot.
- know what the overarching goal is from past interactions & various documentation it has generated about the codebase, as well as higher-level docs describing the company and products.
- realize the screenshot is showing a problem with the program.
- form hypothesis / ideate why the bug occurs.
- verify hypotheses by observing the world ("the world" to Claude Code is the codebase it lives in, so by "observing" I mean it reads the code).
- run experiments: modify code then run a type check or unit test (although usually the final observation is outsourced to me, so I am the AI's tool as much as the other way around.)
The reason LLM is such a big deal is that they are humanity's first tool that is general enough to support recursion (besides humans of course.) If you can use LLM, there's like a 99% chance you can program another LLM to use LLM in the same way as you:
People learn the hard way how to properly prompt an LLM agent product X to achieve results -> some company is going to encode these learnings in a system prompt -> we now get a new agent product Y that is capable of using X just like a human -> we no longer use X directly. Instead, we move up one level in the command chain, to use product Y instead. And this recursion goes on and on, until the world doesn't have any level left for us to go up to.
We are basically seeing this play out in realtime with coding agents in the past few months.