Closing the Loop

Why AI excels at math and code, struggles in open-ended work, and where it will create value next

Mar 13, 2026

Here’s the question people actually care about right now.

Why is AI so good at some things and still so shaky at others?

Why can it do impressive math, write a lot of usable code, and then turn around and invent a source, miss the point of a memo, or produce writing that sounds smooth but empty?

The answer is simpler than it looks: some work comes with a clear scoreboard, and some work doesn’t.

That turns out to be a very useful way to think about AI. It tells you where capability will compound fastest. It helps explain why models feel almost superhuman in one corner and oddly unreliable in another. And if you run a business, it gives you a practical filter for where AI is likely to create real leverage next.

I posted an early version of this idea in late 2023. At the time, I said math would pull away first. That call ended up directionally right. What I had not explained clearly enough was why.

Part of what I was seeing, even then, was that the biggest jumps were happening first in domains where the system could be checked cleanly and corrected quickly. Where the judgment was slower, fuzzier, or more subjective, shakier behavior kept showing up. I had the pattern. I did not yet have the mechanism.

To understand that mechanism, it helps to start one level lower.

Not with AI.

With learning itself.

The loop

We all learn the same way: We try. We check. We adjust and repeat.
How this happens in practice can look quite differently though:

A musician hits a wrong note and hears it immediately.

A growth team changes a landing page and may know by this afternoon whether conversions moved.

A company that changes its brand story may not know for a quarter whether anything really landed.

Same basic loop. Very different speeds of learning.

That is the loop in the title.

Applied to AI, everything starts making more sense.

AI improves fastest when that loop is tight.
From the outside, that can feel almost exponential.

Three things a tight loop needs

If you want the more technical labels, I think of these as ground-truth clarity, feedback latency, and evaluator stability.

The plain-English version is:

First, you need a clear target.

Did the payment reconcile? Did this version beat the baseline? Those are clear targets.

“Write something memorable” is not.

Second, you need a fast verdict.

If you can tell today, you can improve tomorrow.

If you only find out six months later, the loop crawls.

Third, you need a stable judge.

A tax rule does not wake up in a different mood.

A boardroom does.

A tagline that feels sharp this quarter can feel tired next quarter, or land differently with a different audience, or suddenly sound off because the culture moved underneath it.

Put those three together and you get the thing that makes a tight loop:
the strength of the scoreboard.

Some work has a clean scoreboard.
Some work has a messy one.

And AI learns very differently depending on which kind of work it is doing.

But a strong scoreboard is not the same as a good one.

This is Goodhart’s Law in action: once a measure becomes a target, it ceases to be a good measure.
Code can pass the tests and still be bad software. A support bot can lower handle time by ending the conversation too early.

Tight loops create learning. Good scoreboards create value.

A system will often optimize the thing you asked for, not the thing you meant. Tightening the loop around the wrong metric can make the system look better while making the business worse. Or making the business better as the system gets worse.

Social media is the most instructive example at scale. Platforms tightened the loop around engagement — likes, shares, time-on-page — and the system learned exactly what it was asked to learn. Content that provoked, outraged, or hooked performed better than content that informed or satisfied.

The scoreboard was tight. The learning was fast. The output was reliable. And the thing being optimized was not what anyone would have said they wanted.

The same dynamic shows up at smaller scale in almost every team that starts measuring without thinking carefully about what the measure is actually rewarding.

A support team that tracks handle time gets faster resolutions and more closed tickets — and may not notice for months that customer satisfaction quietly deteriorated. A content team that measures time-on-page watches its writers drift toward outrage and curiosity gaps, because that is what moves the number.

The loop tightens. The system learns. The scoreboard wins. The business loses.

Tight loops create learning. Good scoreboards create value. You need both.

Why math moved first

Math comes with a strong scoreboard.

The target is clear: a proof is valid or it is not.

The verdict can be mechanical: given a formal system, a checker verifies each step without waiting for anyone's opinion.

The judge is stable: mathematical truth does not shift with the news cycle, the audience, or the quarter.

That setup is powerful. Not because math is simple — it is not — but because the feedback loop is structured enough that improvement compounds. Each attempt produces a real signal. That signal goes back into the system. The system adjusts. Repeat at scale.

This suggests one thing:
Intelligence is the ability to navigate and leverage structure.

While this works really well already, it is important to point out that this doesn’t mean that math is “solved.” Research-grade problems are still hard. OpenAI’s First Proof work makes the opposite point clear: correctness at the frontier is difficult to establish even with expert review, and one proof the team initially believed was likely correct was later judged otherwise.

So yes, the loop is tighter in parts of math. No, it is not closed everywhere.

Why code followed

Code is the bridge case.

It is messier than math. But in the right settings, it still has a very strong scoreboard.

When the task is narrow, the questions are simple.

Did the code run?
Did the tests pass?
Did the page get faster?
Did the cloud bill go down?

Those are useful verdicts.

That is why code has moved so quickly.

A small open-source research setup published by AI researcher and engineer Andrej Karpathy makes this logic obvious:
An AI agent changes the code, trains for five minutes, checks whether the result got better, keeps the change if it did, throws it away if it did not, and repeats overnight.

This is why bug fixes, test writing, migration work, refactors, and tightly scoped tickets moved first and continue to move fastest.

The task is bounded.
The scoreboard is real.

Go wider and the picture changes.

Software is not just code. It is architecture decisions, product trade-offs, unclear requirements, stakeholder tension, ugly edge cases, and things nobody thought to test.

That is where the scoreboard gets weaker.

And that is why AI can already do a surprising amount of software work and still need constant steering on the harder calls.

The bottleneck starts moving.
Less of the value sits in typing code.

More of it sits in defining the task, building the tests, and deciding what to do when the tests are incomplete or pointed at the wrong thing.

Why writing stays messy

Not all writing is created equal.

Some types have decent scoreboards. Legal and compliance writing does. Factual summaries do. Editing for grammar, structure, and clarity does. In these cases, the target is defined well enough, the verdict arrives quickly enough, and two careful reviewers would mostly agree. AI is already taking real work out of these categories.

But here is the more interesting observation: the scoreboard does not just determine how fast AI improves. It shapes the output you get.

When a system is trained on a loop that rewards smooth grammar and generic clarity, it gets very good at smooth grammar and generic clarity. That is what the judge is measuring, so that is what gets reinforced.

This is why so much AI writing sounds oddly similar. Fluency is measurable. Voice, taste, and originality are much harder to score consistently.

Literary prose, persuasive argument, brand voice, and cultural timing live in a different world. Different readers disagree about what is good. Culture moves. The same line can feel sharp in one moment and stale in the next. The success of the work may only show up much later, in signals that are hard to trace back to one paragraph or one phrase.

That same measurement gap that flattens voice also does something more dangerous: it distorts what the model treats as correct. Hallucinations are often misunderstood as a memory problem or a knowledge problem.

When the domain has a strong loop, errors get caught and corrected. The model learns that confidence without correctness is penalized.

When the loop is weak, the model has less signal distinguishing a right answer from a fluent-sounding wrong one, so it optimizes for what it can measure, which is fluency.

A model trained to be helpful learns that 'I don't know' is a bad answer. Pair that with weak correction signal and you get confident-sounding fluency in place of honest uncertainty. The reward structure is a scoreboard choice, and it has consequences.

This is why people get burned in exactly the places they least expect it.

The output sounds confident. The citation looks plausible. The summary feels complete.

In a domain with a strong scoreboard, more of that gets caught before it reaches you. In a weakly checked workflow, it slips through looking polished.

When writing with AI, ask: where am I on the scoreboard spectrum?

What this means if you run a business

Before you ask whether AI can do a job, ask whether the work has a clear scoreboard.

A simple diagnostic: take your ten most important workflows and score each one on three questions.

Is the target clear enough that two people would agree whether the output was right?

Would you know within a day or a week — not a quarter — whether it worked?

Would the same output get the same verdict from the same reviewer six months from now?

That simple exercise will tell you a lot.

The higher the score, the more aggressive you can be with automation. The lower the score, the more you want AI in support mode and a human making the final call.

Here are three broad categories that matter across industries.

Rule-heavy operations

Think claims handling, underwriting, invoicing, procurement checks, support triage, contract review, compliance review.

These workflows look messy from the outside, but many steps inside them have clearer scoreboards than people realize.

The rules are often explicit.
The errors are often visible.
The review can happen quickly.

That is where AI can take real work out of the system, especially when exceptions get routed to a human instead of being forced through.

Experiment-heavy growth

Think landing pages, pricing tests, ad creative for performance channels, recommendation tuning, routing and scheduling, churn experiments.

When the metric is visible and the cycle is short, AI can help generate and test far more variations than a human team would on its own.

This is where the compounding effect starts to show up.

A tighter loop means more shots on goal.
More shots on goal, with feedback, means faster learning.

Judgment-heavy work

Think brand positioning, key-account strategy, hiring, major partnerships, M&A, board communication, investment decisions.

AI can still help a lot here.

It can draft.
It can research.
It can help you think.
It can surface patterns.
It can generate options.

But the final judgment stays human for longer because the scoreboard is weak and the judge keeps moving.

That is what “human in the loop” really means.

When the scoreboard is weak, the human becomes the scoreboard.

Where the scoreboard is strong, design for automation.
Where it is partial, design for human review.
Where it is weak, use AI for range and speed, but let people carry the judgment.

AI and jobs

This is also where the automation debate gets too blunt.

Jobs are bundles of tasks, and those tasks can sit in very different places on the scoreboard spectrum.

A marketer can use AI to generate variants, summarize research, and draft copy, while the final call on positioning still stays human.

A finance team can automate reconciliations and anomaly flags, while judgment about what matters and what to do next still sits with people.

Same job title. Very different scoreboards inside it.

This is why most jobs will change shape before they disappear - if they do at all.
The checkable tasks get automated first. The messy tasks stay with people longer. The role gets rebuilt around that boundary.

Scoreboard Engineering

Most teams treat that boundary as something to discover, and the role rebuilds itself around wherever the line happens to fall.

But the boundary is not fixed. It is engineerable.

Some teams are not just finding where the scoreboard is strong. They are building one where it did not exist before.

Call it scoreboard engineering: the work of taking something fuzzy and making part of it checkable.

A content team cannot easily measure writing quality - until it builds a rubric, applies it consistently for a few months, and suddenly has signal it can act on.

A support operation cannot train AI on “good resolution” - until it defines precisely enough what a good resolution is, in a way that holds up across reviewers and across time.

A recruiting team cannot automate candidate screening - until it does the harder work of figuring out what a strong first round actually looks like.
Which signals predicted performance on the job? Which turned out to be noise? Which criteria would two different interviewers score the same way, consistently, six months apart?
That work is not glamorous. But once it is done, part of what felt like irreducible human judgment has moved across the line.

None of those examples required a new model. They required someone to do the harder upstream work: define the target, build the feedback mechanism, stabilize the judge.

The loop was always there. It just needed to be tightened.

This is the skill that will separate teams that get compounding returns from AI from teams that get intermittent ones.

Not which tools they adopt. Whether they can take fuzzy work and engineer enough structure around it that real learning becomes possible.

Break the work into smaller loops. Define what good looks like precisely enough to measure. Route the exceptions that fall outside your definition to a human. Revisit the definition as the work evolves.

That is the move.

Will this work for everything? No.
Will the definitional work be easy? Rarely — getting a team to agree on what 'good' even means is often the hardest part.

But that struggle is the work. And as models get stronger, the teams who've done it will compound. The ones who haven't will keep getting intermittent results.

So the real question isn't whether scoreboard engineering matters. It's who in your organization is doing it.

Change the question

We have to stop asking where AI is intelligent and start asking where the scoreboard is strong.

That shift matters because it turns AI strategy into system design.

Can you make the target clearer?
Can you shorten the time to feedback?
Can you make the judge more consistent?

If you can, AI will usually do more than people expect.

If you cannot, be careful about pretending the loop is tighter than it is.

This is why capability looks lopsided right now.

Some domains are already set up for fast compounding.
Others are still waiting for somebody to engineer the scoreboard.

And that is where a lot of the next wave of advantage will come from.

Not just from better models.
From better environments around the models.

From the outside, that can look like sudden intelligence.

A lot of the time, it is really better feedback.

That is what closing the loop means.

Closing the Loop is about mental models for an AI future: practical ways to see where the technology will be strong, where it will be fragile, and what to do about it.

If this framing was useful, share it with someone building, operating, or investing in that future.

Closing the Loop: Mental Models for an AI Future

Discussion about this post

Ready for more?