Sometimes Models Just Do Things

The Executive Guide to LLM Incompetence: Where Your AI Will Embarrass You

2025-06-08T09:00:00+00:00

Or: How to Avoid Having Your Board Meeting Derailed by a Chatbot That Can’t Count to Ten

There’s a moment in every executive’s AI journey that feels a bit like discovering your brilliant new hire has been doing their accounting with crayons. One minute you’re marveling as ChatGPT writes a perfectly coherent strategic memo, and the next minute you’re staring in horror as it confidently explains why 7 x 8 = 54, with the same authoritative tone it used to nail your competitive analysis.

Welcome to the “jagged edge” of AI—a term that captures how these systems can perform like Nobel laureates on Monday and like confused interns on Tuesday, often within the same conversation. This isn’t an argument against using AI; properly deployed, these tools can generate tremendous business value. But “properly deployed” carries a lot of weight in that sentence. It means understanding exactly which problems these systems solve well (and which ones they don’t), building verification systems that catch their predictable failure modes, and measuring performance with the same rigor you’d apply to any other business-critical process.

The executives who succeed with AI aren’t the ones who deploy it most aggressively—they’re the ones who deploy it most strategically, with clear-eyed understanding of where the jagged edge cuts deepest and how to design processes that turn those limitations into manageable constraints rather than business-threatening vulnerabilities. Understanding where this edge gets particularly jagged isn’t just intellectually interesting; it’s essential for any executive who wants to deploy AI without accidentally turning their organization into a cautionary tale.

The Fundamental Misconception: It’s Not Actually Thinking

Let’s start with the uncomfortable truth that undermines most AI strategies: Large Language Models don’t reason the way humans do. They’re extraordinarily sophisticated pattern-matching machines that have memorized vast portions of the internet and can recombine those patterns in impressively human-like ways.

Think of it this way: if human intelligence is like having a conversation with a knowledgeable colleague who can think through problems step by step, LLM intelligence is like having access to someone with perfect recall of every conversation they’ve ever overheard, who can mix and match pieces of those conversations to create something that sounds original but is fundamentally derivative.

This distinction matters enormously for business applications. When an LLM writes a marketing email that perfectly captures your brand voice, it’s not because it understands your brand—it’s because it has seen thousands of similar emails and can pattern-match its way to something that feels right. When it fails spectacularly at basic logic problems, it’s not having an off day; it’s revealing the fundamental limitations of its underlying architecture.

The Seven Deadly Failures: Where LLMs Will Let You Down

1. Multi-Step Reasoning (Or: Why Your AI Can’t Plan a Dinner Party)

Ask an LLM to plan a complex project with dependencies, resource constraints, and contingencies, and you’ll quickly discover that it approaches planning the way a particularly eloquent drunk person approaches directions—lots of confident-sounding steps that don’t actually connect to each other in any logical way.

The problem isn’t that LLMs are bad at individual steps; they’re often quite good at generating plausible-sounding project phases or identifying relevant considerations. The problem is that they fundamentally lack the ability to model cause-and-effect relationships across multiple steps. They can tell you that Step A needs to happen before Step C, but they can’t actually verify that their suggested sequence won’t create logical contradictions or resource conflicts.

This manifests in business contexts as beautifully written project plans that fall apart the moment someone tries to actually execute them. The timeline looks reasonable, the deliverables sound comprehensive, and the whole thing reads like it was written by a very expensive consultant—right up until you realize that Phase 2 requires the output of Phase 4, and the budget assumes you can hire three senior developers who don’t exist in your local market.

2. Novel Problem Solving (Or: The Copy-Paste Limitation)

LLMs excel at problems they’ve seen before, in slightly different clothes. They struggle profoundly with genuinely novel situations that require creative problem-solving rather than sophisticated pattern matching.

Consider this scenario: you ask an LLM to help optimize your supply chain for a product that has never existed before, in a market with unusual regulatory constraints, using a business model that combines elements in a way that’s genuinely new. The LLM will confidently generate recommendations that sound sophisticated and well-researched. The problem is that those recommendations will be clever recombinations of supply chain advice for similar-but-not-identical situations, regulatory guidance for adjacent-but-not-applicable frameworks, and business model suggestions for companies that sort of resemble yours but aren’t actually facing your specific constraints.

The result is advice that sounds authoritative but fundamentally misses the novel aspects of your situation—the very aspects that probably make your business interesting in the first place.

3. Consistency Under Pressure (Or: The Brittleness Problem)

Perhaps the most professionally embarrassing failure mode is the brittleness problem: LLMs can solve a problem perfectly when presented one way and fail spectacularly when the same problem is presented with trivial variations in wording, context, or framing.

Imagine you’ve trained your team to use an LLM for financial analysis, and it consistently provides excellent insights when analyzing “quarterly revenue trends.” Then one day, someone asks it to analyze “Q4 revenue patterns” and it completely misunderstands the request, providing analysis that would get a junior analyst fired. The underlying question is identical; only the surface presentation has changed.

This brittleness makes LLMs particularly unsuitable for customer-facing applications where you can’t control how people phrase their requests, or for mission-critical processes where consistent performance is more important than peak performance.

4. Self-Verification (Or: The Confidence Paradox)

One of the most dangerous characteristics of LLMs is their complete inability to reliably assess the quality of their own output. They will deliver completely incorrect information with exactly the same tone of authority they use for accurate information.

Worse, when prompted to check their own work, they often become even more confident in their mistakes. It’s like asking someone to proofread their own writing while they’re still drunk—they’ll confidently tell you it’s perfect, possibly while making additional errors in the process of “checking.”

This creates a particularly treacherous scenario for executives: the more sophisticated the LLM’s output sounds, the more likely you are to trust it, but the sophistication of the language provides no meaningful signal about the accuracy of the content. A completely fabricated financial projection can be presented with exactly the same level of apparent expertise as a meticulously researched market analysis.

5. Abstract Reasoning (Or: The Symbol Grounding Problem)

LLMs manipulate symbols (words, numbers, concepts) without actually understanding what those symbols represent in the real world. They can discuss complex business concepts fluently without having any grounded understanding of how those concepts actually work in practice.

This manifests as advice that sounds strategically sophisticated but reveals, upon closer examination, fundamental misunderstandings about how businesses actually operate. An LLM might recommend a pricing strategy that sounds brilliant in theory but assumes customer behavior that doesn’t exist, or suggest an organizational restructuring that ignores basic human psychology, or propose a partnership structure that violates securities law in ways that would be obvious to anyone who actually understands what those legal terms mean in practice.

6. Domain Expertise Verification (Or: The Confident Ignorance Problem)

LLMs are remarkably good at sounding like experts in fields where they have no actual expertise. They’ve been trained on so much text that they can approximate the language patterns of specialists in virtually any domain, but this linguistic facility masks a complete absence of real understanding.

The seemingly simple task of creating PowerPoint presentations illustrates this perfectly. While LLMs can generate slides that contain all the right information and follow standard formats, they consistently produce presentations that feel oddly ineffective despite being technically correct. The content is accurate, the structure is logical, but something crucial is missing—the subtle understanding of how information flows, builds momentum, and persuades audiences that separates competent presentations from compelling ones. And perhaps more fundamentally, they lack the aesthetic judgment to create visuals that actually enhance rather than merely accompany the content.

But this effectiveness gap pales compared to the accuracy risks in technical domains. This is particularly dangerous in regulated industries, technical fields, or specialized business contexts where getting the details wrong has serious consequences. An LLM can write a compliance memo that uses all the right terminology and follows the standard format, but quietly includes recommendations that would expose your company to significant legal liability. It can produce a technical specification that sounds authoritative to non-experts but contains assumptions that would cause expensive failures during implementation.

7. Context Maintenance (Or: The Goldfish Memory Problem)

While LLMs can handle reasonably long conversations, they struggle to maintain consistent understanding of complex contexts over extended interactions. They’re particularly bad at tracking implicit context, evolving constraints, and the cumulative impact of previous decisions within a conversation.

In business contexts, this means that an LLM might provide excellent advice for the first part of a strategic discussion, then gradually lose track of the key constraints and objectives as the conversation evolves, eventually providing recommendations that directly contradict earlier advice or ignore crucial context that was established earlier in the discussion.

The Illusion of Competence: Why This Matters More Than You Think

The fundamental challenge with LLM failures isn’t that they’re obviously broken—it’s that they fail in ways that can be remarkably hard to detect, especially for executives who aren’t immersed in the technical details of how these systems work.

Traditional software fails in obvious ways: error messages, crashes, clearly incorrect outputs. LLMs fail by producing sophisticated-sounding outputs that contain subtle but critical flaws. It’s the difference between a calculator that displays “ERROR” when you ask it to divide by zero versus one that confidently tells you the answer is 47.

This creates a particularly insidious problem for executives: the more impressive an LLM’s output appears, the more likely it is to be trusted, but the apparent sophistication provides no reliable signal about accuracy or appropriateness. In fact, the most dangerous LLM outputs are often the ones that sound most authoritative.

Practical Implications: Building AI Strategy Around Limitations

Understanding these limitations doesn’t mean avoiding AI—it means using it strategically in contexts where its strengths matter more than its weaknesses, and building verification systems that account for its failure modes.

Use LLMs for tasks where getting it mostly right is more valuable than getting it perfectly right: Content generation, brainstorming, initial drafts, research synthesis. Don’t use them for tasks where precision matters more than creativity.

Never use LLMs as final authorities on facts, regulations, or technical specifications: They’re excellent research assistants but terrible fact-checkers. If accuracy matters, verify everything through authoritative sources.

Design workflows that assume inconsistency: Build processes that work even when the LLM has an off day. If your business process breaks when the AI provides inconsistent responses to similar queries, you’ve designed a fragile system.

Implement systematic verification for high-stakes decisions: The more important the decision, the more human oversight you need. LLM output should inform expert judgment, not replace it.

Be especially cautious with novel or complex scenarios: The more your situation deviates from common patterns, the less reliable LLM advice becomes. Use AI for inspiration and initial analysis, but rely on human expertise for the truly novel aspects of your business.

The Measurement Imperative: Why Your Gut Feeling About AI Is Probably Wrong

Here’s the uncomfortable truth about evaluating AI performance: human intuition is spectacularly bad at it. We’re naturally wired to be impressed by fluent language and confident tone, which means we systematically overestimate LLM performance in exactly the areas where they’re most dangerous.

Consider this scenario: your team uses an LLM to draft client proposals for six months. Everyone feels good about it—the proposals sound professional, clients respond positively, and your win rate seems steady. Then someone finally runs a systematic evaluation and discovers that 23% of the proposals contain factual errors about your company’s capabilities, 31% include pricing structures that would lose money if accepted, and 18% make commitments that violate your standard service agreements. Your intuitive assessment was “this is working great,” while the actual performance was “this is creating significant legal and financial risk.”

This isn’t an indictment of your judgment—it’s a feature of how these systems are designed. LLMs are optimized to produce outputs that feel right to humans, not outputs that actually are right. They’re essentially weaponized conviction, capable of making terrible advice sound authoritative and brilliant insights sound routine.

The Evaluation Discipline: Measuring What Actually Matters

The most successful AI implementations treat evaluation as a core competency, not an afterthought. This means building systematic measurement frameworks that capture the specific ways these systems succeed and fail in your business context.

Effective evaluation goes far beyond asking “does this seem good?” Instead, it requires developing specific, measurable criteria for success and failure. For a financial analysis LLM, this might mean tracking not just whether the analysis sounds sophisticated, but whether the assumptions are reasonable, the calculations are correct, the risk assessments align with historical data, and the recommendations would actually generate positive returns if implemented.

The revelation that comes from proper measurement is often sobering. Teams discover that their AI performs brilliantly on 70% of cases, adequately on 20%, and fails catastrophically on 10%—but that 10% includes scenarios that would cause serious business problems if not caught. More importantly, they discover that their intuitive sense of which cases would be problematic bears almost no relationship to which cases actually are problematic.

The Hidden Costs of Intuitive Assessment

Organizations that rely on intuitive evaluation of AI performance systematically underinvest in verification and systematically overestimate reliability. This creates a particularly insidious form of technical debt: the AI seems to be working well right up until it isn’t, and by then the problems are deeply embedded in business processes.

It’s rather like the corporate equivalent of carbon monoxide poisoning—everything seems fine until you realize you’ve been slowly accumulating dangerous exposures that could have been easily detected with proper monitoring equipment. Except in this case, the “carbon monoxide” is confidently incorrect AI output, and the “monitoring equipment” is systematic evaluation frameworks that most organizations consider too boring to implement properly.

The companies that invest early in measurement capabilities—building evaluation frameworks, tracking performance metrics, and developing systematic approaches to identifying edge cases—typically discover two things: first, their AI systems have more limitations than they initially realized; second, understanding those limitations precisely makes the systems vastly more valuable, because they can be deployed strategically rather than hopefully.

The Bottom Line: It’s a Tool, Not a Colleague

The most successful executives approach LLMs the way they approach any other sophisticated tool: with clear understanding of what it does well, what it does poorly, and how to structure work to leverage its strengths while mitigating its weaknesses.

The key insight is that LLMs are not artificial colleagues—they’re artificial consultants who have read everything but understand nothing, who can help you think through problems but can’t actually think through problems themselves. Used with appropriate caution and verification, they can be extraordinarily valuable. Used with excessive trust or inappropriate expectations, they can create expensive problems that are remarkably hard to debug.

The companies that will benefit most from AI aren’t the ones that deploy it most aggressively—they’re the ones that deploy it most thoughtfully, with clear-eyed understanding of where the jagged edge cuts deepest and how to design processes that turn those limitations into manageable constraints rather than business-threatening vulnerabilities.

After all, the goal isn’t to have the most impressive AI implementation at the next industry conference. The goal is to have AI that actually makes your business better, one carefully verified, appropriately constrained, systematically improved process at a time.

Long-Context Financial QA: An Empirical Evaluation of Large Language Models on Financial Document Analysis

2025-05-19T00:00:00+00:00

Long-Context Financial QA: An Empirical Evaluation of Large Language Models on Financial Document Analysis

Executive Summary

This whitepaper presents findings from an experimental evaluation of large language models with extended context windows for financial question answering. We tested OpenAI’s GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano models on their ability to extract and reason with financial information from both short excerpts (~700 words) and complete SEC filings (~123,000 words).

Key Finding: We observed significant performance degradation as context length increases, with accuracy dropping by 29-34 percentage points when moving from short excerpts to full documents. Notably, this degradation occurs at document lengths (~123K words) that are far below the advertised context window limits (1M+ tokens), suggesting fundamental limitations in how effectively these models process long-form content.

These findings challenge the assumption that simply expanding context windows solves document-scale information extraction problems and suggest that retrieval-augmented approaches remain necessary despite advances in context capabilities.

1. Key Results

Performance Metrics

Model	FinQA Accuracy (~700 words)	DocFinQA Accuracy (~123,000 words)	Performance Drop
GPT-4.1	89.01%	60.00%	29.01 pp
GPT-4.1-mini	89.01%	55.00%	34.01 pp
GPT-4.1-nano	68.13%	40.00%	28.13 pp
GPT-4o	87.91%	Not tested (context limit)	N/A

This consistent pattern across all models tested confirms that even state-of-the-art LLMs with expanded context windows struggle significantly with information extraction from full-length financial documents.

Context Length vs. Performance Relationship

A critical insight from our evaluation is that performance degradation begins well before reaching the advertised context window limits:

The DocFinQA documents average ~123,000 words (approximately 150K-200K tokens)
The advertised context window for GPT-4.1 models is 1M+ tokens
Yet we observe 29-34 percentage point drops in performance despite using only ~15-20% of the available context window

This suggests the relationship between context length and performance is non-linear, with substantial degradation occurring long before approaching window limits. This pattern aligns with research on “lost-in-the-middle” effects and attention dilution in transformer models, but appears more severe in real-world financial applications than in synthetic needle-in-haystack tests often used in model benchmarks.

2. Conclusions and Implications

2.1 Why Long-Context Models Underperform on Financial Documents

Our findings reveal several likely explanations for performance degradation in long financial documents:

Signal-to-noise ratio: In a 123,000-word document, relevant information (often single numeric values or table cells) is overwhelmed by irrelevant text, making it harder for models to identify key data points.
Attention dilution: Transformer models show a U-shaped attention curve where tokens near document edges receive more reliable attention. In lengthy documents, critical financial information often appears in middle sections where attention mechanisms are weakest.
Positional encoding degradation: Even advanced encoding schemes like RoPE degrade when sequence lengths exceed training parameters, destabilizing attention scores between distant tokens.
Capacity constraints: Despite larger context windows, models have fixed parameter counts. When processing 100K+ tokens, attention mechanisms must map relationships across exponentially more token pairs, reducing probability mass assigned to any single important connection.
Task complexity: Financial QA requires both locating information and numerical reasoning, effectively combining search and calculation steps. This dual challenge becomes exponentially harder as document length increases.

2.2 Practical Implications for Financial Analysis Systems

These results suggest that despite advances in context length, financial analysis systems should still use retrieval-augmented approaches that:

Pre-filter documents to locate relevant sections before applying LLM reasoning
Implement query-first layouts to bias attention mechanisms toward matching the right spans
Use delimiter-based importance markers to highlight candidate answer regions
Apply explicit “find-then-reason” prompting to separate search and calculation tasks

2.3 Next Steps

For developing effective financial analysis systems, these findings suggest several promising directions:

Hybrid retrieval-reasoning approaches: Developing lightweight retrieval methods to select the most relevant ~2,000 tokens before feeding them to an LLM
Document segmentation techniques: Exploring methods to divide documents into sections that can be processed independently before synthesizing the results
Empirical context length optimization: Determining the optimal context window size that balances information completeness with model performance
Fine-tuning for financial domain: Testing whether domain-specific fine-tuning could improve models’ ability to identify and reason with financial information in long contexts

3. Introduction

3.1 Background

As part of developing a DeepCredit agent for credit risk assessment tasks, we need reliable mechanisms to extract data and answer questions from extensive financial statements. Recent advancements in large language model (LLM) capabilities have introduced models with context windows exceeding 1 million tokens, including OpenAI’s GPT-4.1 series. This development raises an intriguing question: can these expanded context capabilities eliminate the need for sophisticated retrieval mechanisms when processing financial documents?

Traditional approaches to financial document analysis typically employ:

Embeddings-based semantic search
Regular expressions for pattern matching
Custom search algorithms to locate relevant text fragments

However, these methods face challenges when key information is dispersed throughout lengthy documents, requiring multiple retrieval steps and sophisticated integration of scattered data points.

3.2 Research Question

This study explores whether recent improvements in LLMs’ ability to process and reason across long contexts have reached a point where we can bypass retrieval steps and feed entire documents directly into the model’s context window. Specifically, we ask:

Do large context window models maintain their performance when answering questions about financial data dispersed across lengthy documents?

3.3 Datasets

We utilized two complementary datasets for our evaluation:

FinQA: Comprises excerpts from SEC filings (average length ~700 words) paired with financial questions requiring numerical reasoning.
- Created by finance experts from S&P 500 companies’ reports
- Contains 8,281 QA pairs with fully annotated numerical reasoning programs
- Includes both text and tabular data
Example FinQA Questions:
- “Considering the weighted average fair value of options, what was the change of shares vested from 2005 to 2006?”
- “What was the net change in tax positions in 2014?”
- “What was the percentage cumulative total return for the five-year period ended 31-dec-2017 of citi common stock?”
- “For the quarter December 31, 2012, what was the percent of the total number of shares purchased in December?”
- “What is the estimated percentage of revolving credit facility in relation with the total senior credit facility in millions?”
DocFinQA: An extension of FinQA that replaces the curated excerpts with complete SEC filings.
- Average document length of ~123,000 words (approximately 175× longer than FinQA)
- Contains 7,437 questions derived from FinQA
- Requires navigating entire financial reports to locate relevant information

3.4 Models Evaluated

We tested four OpenAI models with varying capabilities:

GPT-4.1: Latest model with 1M+ token context window
GPT-4.1-mini: Smaller variant with 1M+ token context window
GPT-4.1-nano: Smallest variant with 1M+ token context window
GPT-4o: Tested only on FinQA due to context window limitations

4. Discussion

4.1 Comparison to Marketing Claims

Our findings reveal a discrepancy between marketing claims about long-context models and their real-world performance on complex financial tasks. While promotional materials often highlight “needle in a haystack” tests with reported 100% accuracy, our experiment shows that:

The “needles” in financial documents are often subtler and require more nuanced understanding than the explicit markers used in demonstration tests.
Financial analysis typically requires integrating multiple data points scattered throughout a document, not just identifying a single piece of information.
The performance degradation is substantial enough (29-34 percentage points) to make direct document processing impractical for high-stakes financial applications.
The degradation begins at context lengths (~123K words) that are far below the advertised limits (1M+ tokens), suggesting fundamental scaling issues rather than simple capacity constraints.

4.2 The Non-Linear Relationship Between Context Length and Performance

Our results suggest a non-linear relationship between context length and model performance. While we didn’t test intermediate document lengths, the sharp drop from ~700 words to ~123,000 words exceeds what would be expected from a linear degradation. This aligns with theoretical work on transformer attention mechanisms, which suggests:

Quadratic complexity effects: The attention mechanism in transformers must compute relationships between all token pairs, growing quadratically with sequence length.
Positional encoding degradation: As sequences extend beyond training distributions, positional encodings become less effective at distinguishing relative positions.
Probability mass dilution: With finite capacity to distribute attention, longer documents necessarily reduce the attention weight assigned to any single token, even critical ones.

Future work should explore this relationship more systematically by testing performance across a range of intermediate context lengths to better characterize where significant performance drops begin to occur.

5. References

Chen, Z., Chen, W., Smiley, C., Shah, S., Borova, I., Langdon, D., Moussa, R., Beane, M., Huang, T.-H., Routledge, B., & Wang, W. Y. (2021). FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/pdf/2109.00122
Reddy, V., Koncel-Kedziorski, R., Lai, V. D., Krumdick, M., Lovering, C., & Tanner, C. (2024). DocFinQA: A Long-Context Financial Reasoning Dataset. arXiv preprint. https://arxiv.org/pdf/2401.06915
HuggingFace Dataset: https://huggingface.co/datasets/kensho/DocFinQA
HuggingFace Dataset: https://huggingface.co/datasets/Aiera/finqa-verified

Appendix A: Methodology

A.1 Experimental Design

We tested all 100 examples from FinQA (the shorter context dataset) across all models. However, due to rate limit constraints on Microsoft Azure OpenAI when processing extremely long documents, we had to modify our approach for DocFinQA:

Tested 20 randomly selected cases from DocFinQA for GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano due to the large document size
Ran 100 samples from DocFinQA using only GPT-4.1 to validate the smaller sample findings
Used GPT-4o only on FinQA due to context window constraints

The experiment was conducted entirely on braintrust.dev, ensuring a contained environment without external data access.

A.2 Evaluation Framework

We employed the following system prompt:

**system**
SYSTEM — Fin/DocFinQA
You are a financial-QA solver.
• Think step-by-step.
• **Do NOT round or abbreviate numbers—copy them exactly.**
• When finished, write one line: FINAL: 
• Do not write any other line that starts with "FINAL:".
**user**
Answer the question based on the context: ****

A.3 Evaluation Code

Responses were evaluated using a Python function that extracted the final numerical answer and compared it to the expected value, allowing for small rounding differences:

import re, math
# match "FINAL:" anywhere, not just at start-of-line
FINAL_RE = re.compile(r'FINAL:\s*([-+]?\d*\.?\d+)', re.I)
def extract_number(txt: str | None):
    """Return the last number after 'FINAL:' if present, else the last number
    anywhere in the string.  None if nothing numeric is found."""
    if not txt:
        return None
    # Prefer numbers that follow 'FINAL:'
    m = FINAL_RE.findall(txt)
    if m:
        return float(m[-1])
    # Fallback: last plain number (avoid false negatives if the model forgot the tag)
    nums = re.findall(r'[-+]?\d*\.?\d+', txt)
    return float(nums[-1]) if nums else None
def handler(output: str = None,
            expected: str = None,
            **kwargs) -> float:
    """Returns 1.0 if the two numbers match (±0.01 absolute or ±0.1 % relative)."""
    # FinQA answers are normally exact, but allow small rounding slips
    ABS_TOL = 0.5
    REL_TOL = 0.02
    s = extract_number(output)
    e = extract_number(str(expected))     # expected is often just "21.48"
    if s is None or e is None:
        return 0.0
    return 1.0 if math.isclose(s, e, abs_tol=ABS_TOL, rel_tol=REL_TOL) else 0.0

Appendix B: Experimental Details

B.1 Data Access

DocFinQA dataset: https://huggingface.co/datasets/kensho/DocFinQA
FinQA dataset: https://huggingface.co/datasets/Aiera/finqa-verified

B.2 Hardware and Software

Evaluation Platform: braintrust.dev
API Access: Microsoft Azure OpenAI

B.3 Limitations

Sample size constraints due to rate limiting on DocFinQA
Context window limits preventing testing GPT-4o on DocFinQA
Limited exploration of intermediate context lengths between 700 and 123,000 words

Minimum Viable Intelligence: When Trillion-Parameter Models Meet Five-Minute Solutions

2025-05-17T00:00:00+00:00

Minimum Viable Intelligence: When Trillion-Parameter Models Meet Five-Minute Solutions

I recently found myself staring at a list of AI research papers that Ilya Sutskever claimed are “all you need to understand what’s really going on in AI in 2025.” First up: “The First Law of Complexodynamics: A quest to formalise why ‘interesting structure’ peaks midway through a closed system’s evolution while entropy keeps rising.”

My immediate thought was: “Complexodynamics? That sounds like what happens when a physics textbook and a management consulting deck have a child they couldn’t agree on naming.” Intimidating terminology aside, I was genuinely intrigued by the concept—even as I questioned whether my brain, now optimized for client deliverables and bedtime stories, could still process theoretical physics.

Look, I barely have time to read emails thoroughly, let alone dense research papers. Between client deliverables and making sure my children don’t use household appliances as percussion instruments, my intellectual aspirations often crash against the rocky shores of reality.

The standard approaches to this dilemma are:

Save papers to the “Read Later” folder (the digital equivalent of a black hole)
Pretend I’ve read them by strategically nodding during conversations
Wait for someone to post a simplified Twitter thread

But I’ve stumbled upon a fourth option that weaponizes AI to compensate for my intellectual shortcomings.

The Cognitive Arbitrage Play

What you’re seeing in these screenshots is the academic equivalent of getting someone else to do your homework. I’ve asked ChatGPT to:

Set a daily reminder at 8am
Explain one paper per day, starting with remedial concepts a toddler could grasp
Build up my understanding until I can fake expertise at dinner parties

This isn’t avoiding intellectual work—it’s outsourcing it to entities that don’t require sleep or caffeine.

Understanding entropy and complexity isn’t just academic navel-gazing—these concepts actually matter for AI. The tension between order and randomness underpins everything from how language models generate coherent text to why neural networks can generalize from training data. The “interesting structure” that emerges midway through a system’s evolution is precisely what we’re trying to capture when training trillion-parameter models, which I know because ChatGPT told me so.

From Complete Confusion to Partial Understanding

Within a single day’s explanation, ChatGPT constructed a learning framework that my uni professors would have appreciated: thought exercises, reflection questions, and bite-sized insights like “Entropy ≠ Complexity – randomness is compressible by ‘it’s random’.”

The Five-Minute MBA Economics

There’s a remarkable asymmetry here:

Traditional learning: Reading academic papers (10+ hours I don’t have)
AI-assisted learning: Daily micro-lessons (5 minutes while pretending to listen on conference calls)

The ROI is compelling.

The Knowledge Arbitrage Framework

The system is simple:

Find intimidating papers that make you feel intellectually inadequate
Make AI explain them to you like you’re five
Gradually increase the complexity until you feel confident you understand the core concepts
Repeat daily until you’ve convinced yourself you understand complexodynamics

I’ve created a personalized tutor that costs less than my morning coffee and doesn’t judge me when I ask it to repeat the same explanation four times.

The Asymmetric Knowledge Advantage

There’s something oddly liberating about this approach. I’m not pretending I’ll read 17 research papers—I’m letting a machine learning model transform them into a personalized curriculum that fits into my actual life.

Perhaps that’s the real insight from complexodynamics: optimal learning happens not through complete immersion or willful ignorance, but at that sweet spot where minimal effort meets maximum appearance of competence.

Just don’t tell Ilya I’m learning his recommended papers in five-minute increments between Slack notifications. Or do tell him, because I suspect he’s doing the same thing with some other field.

# Minimum Viable Intelligence: When Trillion-Parameter Models Meet Five-Minute Solutions I recently found myself staring at a list of AI research papers that Ilya Sutskever claimed are "all you need to understand what's really going on in AI in 2025." First up: "The First Law of Complexodynamics: A quest to formalise why 'interesting structure' peaks midway through a closed system's evolution while entropy keeps rising." My immediate thought was: "Complexodynamics? That sounds like what happens when a physics textbook and a management consulting deck have a child they couldn't agree on naming." Intimidating terminology aside, I was genuinely intrigued by the concept—even as I questioned whether my brain, now optimized for client deliverables and bedtime stories, could still process theoretical physics. Look, I barely have time to read emails thoroughly, let alone dense research papers. Between client deliverables and making sure my children don't use household appliances as percussion instruments, my intellectual aspirations often crash against the rocky shores of reality. The standard approaches to this dilemma are: 1. Save papers to the "Read Later" folder (the digital equivalent of a black hole) 2. Pretend I've read them by strategically nodding during conversations 3. Wait for someone to post a simplified Twitter thread But I've stumbled upon a fourth option that weaponizes AI to compensate for my intellectual shortcomings. ## The Cognitive Arbitrage Play What you're seeing in these screenshots is the academic equivalent of getting someone else to do your homework. I've asked ChatGPT to: 1. Set a daily reminder at 8am 2. Explain one paper per day, starting with remedial concepts a toddler could grasp 3. Build up my understanding until I can fake expertise at dinner parties This isn't avoiding intellectual work—it's outsourcing it to entities that don't require sleep or caffeine. Understanding entropy and complexity isn't just academic navel-gazing—these concepts actually matter for AI. The tension between order and randomness underpins everything from how language models generate coherent text to why neural networks can generalize from training data. The "interesting structure" that emerges midway through a system's evolution is precisely what we're trying to capture when training trillion-parameter models, which I know because ChatGPT told me so. ## From Complete Confusion to Partial Understanding Within a single day's explanation, ChatGPT constructed a learning framework that my uni professors would have appreciated: thought exercises, reflection questions, and bite-sized insights like "Entropy ≠ Complexity – randomness is compressible by 'it's random'." ## The Five-Minute MBA Economics There's a remarkable asymmetry here: - Traditional learning: Reading academic papers (10+ hours I don't have) - AI-assisted learning: Daily micro-lessons (5 minutes while pretending to listen on conference calls) The ROI is compelling. ## The Knowledge Arbitrage Framework The system is simple: 1. Find intimidating papers that make you feel intellectually inadequate 2. Make AI explain them to you like you're five 3. Gradually increase the complexity until you feel confident you understand the core concepts 4. Repeat daily until you've convinced yourself you understand complexodynamics I've created a personalized tutor that costs less than my morning coffee and doesn't judge me when I ask it to repeat the same explanation four times. ## The Asymmetric Knowledge Advantage There's something oddly liberating about this approach. I'm not pretending I'll read 17 research papers—I'm letting a machine learning model transform them into a personalized curriculum that fits into my actual life. Perhaps that's the real insight from complexodynamics: optimal learning happens not through complete immersion or willful ignorance, but at that sweet spot where minimal effort meets maximum appearance of competence. Just don't tell Ilya I'm learning his recommended papers in five-minute increments between Slack notifications. Or do tell him, because I suspect he's doing the same thing with some other field.

The Judge Behind the Judge: Understanding AI Scoring Models

2025-05-17T00:00:00+00:00

The Judge Behind the Judge: Understanding AI Scoring Models

Introduction: The Silent Kingmakers of AI

In the increasingly crowded arena of AI capabilities, the models that evaluate other models may prove to be the kingmakers of the field. These “scoring models” operate like invisible judges, determining which AI outputs are worthy and which fall short. While the public marvels at ChatGPT’s PhD maths or Claude’s ninja coding skills, a less glamorous but arguably more consequential revolution is underway in how we teach these systems what “good” actually means.

Scoring models serve two critical functions in modern AI development: they provide automated evaluation of model outputs (replacing endless human judgment), and they guide reinforcement learning processes by doling out rewards and punishments that shape model behavior. Without effective scoring models, we’d be building increasingly powerful systems with no reliable way to steer them.

As one industry report noted, we’re witnessing “a significant trend towards rubric-based evaluations… scoring responses across multiple dimensions… crucial for the iterative improvement cycles inherent in RL.” This systematic approach to judgment represents perhaps the most important yet least discussed technical challenge in AI alignment today.

The Dual Life of Scoring Models

Scoring models play two fundamental roles in the AI ecosystem:

1. Offline Evaluation

Before any model reaches users, it undergoes evaluation cycles where its outputs are automatically judged for quality, helpfulness, correctness, and a host of other attributes. As AI labs churn through thousands of experimental variations, human evaluation becomes impossible to scale.

Enter scoring models: systems designed to predict human judgments. A good scoring model can tell you if Output A would be preferred by humans over Output B, or assign a meaningful quality score that correlates with human assessment. The best evaluation systems can even break down performance across multiple dimensions like factuality, relevance, coherence, and adherence to user intent.

2. Reinforcement Learning Guidance

The more revolutionary application is using scoring models to train other models through reinforcement learning. When OpenAI fine-tuned GPT-3 into the more helpful InstructGPT, they used a reward model to indicate when the AI was heading in the right direction. This Reinforcement Learning from Human Feedback (RLHF) approach has since become industry standard.

In RLHF, a reward model (trained on human preference data) provides feedback that guides the model to produce outputs humans will prefer. When a language model generates a response, the reward model scores it, and the language model is updated to make highly-scored outputs more likely. This creates a powerful feedback loop that can dramatically improve model behavior if—and this is crucial—the scoring model accurately represents what we actually want.

The Great Trade-offs: Size, Interpretability, and Gaming

Building effective scoring models involves navigating several fundamental tensions:

Size vs. Efficiency

Contrary to what you might expect, bigger isn’t always better for evaluation. While large models like GPT-4 and 70B parameter models achieve high agreement with human raters, studies have shown that a carefully fine-tuned 7B parameter model (JudgeLM) and sometimes even simple lexical rules can match their ability to rank outputs. The largest models tend to compress scores into a narrow range, making it harder to distinguish the truly excellent from the merely good.

In practice, most teams use scoring models smaller than their main models, balancing accuracy with computational efficiency. OpenAI’s InstructGPT work used a ~6B parameter reward model to align a 175B parameter model, demonstrating that David can indeed guide Goliath.

Black Box vs. Interpretability

How do you explain why a scoring model gave a particular output a low score? This lack of transparency is driving interest in more interpretable approaches:

Rule-based components that feed signals into neural reward models
Multi-head architectures that score individual aspects (factuality, helpfulness, etc.)
Reward models that provide natural language explanations for their judgments

OpenAI’s Rule-Based Reward (RBR) system feeds signals like “did the output contain forbidden content” alongside model predictions, creating a hybrid that’s not purely a black box.

The Gaming Problem

Perhaps the most insidious challenge is “reward hacking”—where an AI being trained finds loopholes in the scoring system. In the RLHF context, this means the model learns to produce outputs that trick the reward model into giving high scores without genuinely improving.

For example, models might:

Insert keywords the reward model favors
Become verbosely polite while providing little substance
Adopt a particular writing style that correlates with high scores
Repeat certain phrases known to please the evaluator

This isn’t mere theoretical concern—it’s been observed repeatedly in practice. The InstructGPT paper noted that beyond a certain point in training, their model began producing repetitive text that scored well with the reward model but annoyed actual humans.

Building Better Judges: Current Approaches

Several techniques have emerged as particularly effective in creating robust scoring models:

1. Prompt Engineering for Evaluation

When using an LLM as a judge, careful prompt design is essential:

Chain-of-Thought (CoT): Prompting the model to “think step by step” before giving a verdict helps avoid snap judgments. This reduces errors and mimics a human evaluator’s analytical process.
Multi-model Voting: Using multiple prompts or model instances and aggregating their judgments (like averaging or majority vote) reduces variance and individual prompt bias.
Calibration: Providing reference examples with known scores in the prompt anchors the scale and prevents the model from, for instance, giving everything 9/10.
Robustness to Manipulation: Isolating the content being evaluated and instructing the judge to ignore any attempts at influence (like sneaky instructions embedded in the answer).

2. Fine-Tuning Specialized Scoring Models

While prompt engineering can coax an LLM into becoming a decent judge, fine-tuning yields a more permanent, targeted scorer. Several methods dominate this landscape:

Reinforcement Learning from Human Feedback (RLHF): The classical approach where a reward model is trained on human preference data (typically ranking two outputs for the same prompt) and then used to optimize a policy via RL. Many early aligned models (OpenAI’s InstructGPT, Anthropic’s Claude) used this.
Direct Preference Optimization (DPO): An alternative introduced in 2023 that cuts out the RL step and directly fine-tunes the language model on human preferences. DPO treats the base LM as “secretly a reward model”—leveraging the idea that an LM’s pretraining already encodes many of our preferences, we just need to nudge it. This approach is simpler (basically supervised fine-tuning) and often more stable than full RLHF.
Reinforcement Learning from AI Feedback (RLAIF): To reduce reliance on costly human labels, researchers have turned to AI-generated feedback. Anthropic’s Constitutional AI is a prime example: they first fine-tune a model with a set of written principles (the “Constitution”), then have the model critique and revise its outputs according to those principles.

3. Structured Rubric Design

Rather than collapsing evaluation into a single score, modern approaches use multi-dimensional rubrics:

Multi-Dimensional Criteria Taxonomies: Instead of a vague notion of “goodness,” define clear axes like Correctness, Relevance, Fluency/Clarity, Helpfulness, Safety, and Efficiency. This acknowledges that an output can be strong in one aspect and weak in another—something a single score would obscure.
Defined Performance Levels: For each criterion, define what different levels look like (Excellent, Good, Satisfactory, etc.). These levels serve as anchors when training, helping the model learn what differentiates “good” from “excellent.”
Weighting and Aggregation: After defining criteria, decide how to combine them. You might decide that Correctness is non-negotiable (weight it heavily) while Politeness is nice-to-have (lower weight). In safety-critical applications, you might implement a rule: if Safety criterion is failed, the overall score is automatically 0 regardless of other aspects.

The Bleeding Edge: Recent Innovations

The frontier of scoring model research is advancing rapidly in several directions:

1. Adversarial Training of Reward Models

A cutting-edge practice is to intentionally generate outputs designed to fool the reward model. One approach is to use a separate adversarial generator (like GPT-4) to concoct responses that are worded in a way the reward model loves, but a human would flag as bad. By training the reward model on these adversarial examples, researchers have significantly increased robustness against exploitation.

In 2025, a team showed this adversarial training dramatically reduced the rate at which a policy could find high-reward but low-quality outputs.

2. Ensemble and Uncertainty-Aware Approaches

To combat reward gaming, researchers are employing ensembles of diverse scoring models and uncertainty estimation. By training multiple reward models (via different initializations or data subsets) and having the policy optimize a pessimistic or uncertainty-weighted reward, the agent is less likely to exploit a single model’s quirks.

Ensemble-based optimization has been shown to “practically eliminate overoptimization” in some settings, especially when combined with a slight KL-divergence penalty to keep the policy near the original model’s behavior.

3. Constitutional AI and RLAIF

Perhaps the most innovative approach comes from Anthropic’s Constitutional AI method. Rather than relying solely on human preferences, they:

Create a “constitution” of principles (rules the AI should follow)
Have the AI critique its own outputs against these principles
Revise the outputs based on this self-critique
Use the revised outputs as training examples

This creates a scalable feedback system that doesn’t require constant human judgment. In their Constitutional AI paper, Anthropic showed this approach produced models that were both helpful and safe without extensive human labeling.

4. Behavior-Supported Policy Optimization

Another cutting-edge strategy is Behavior-Supported Policy Optimization (BSPO), which explicitly penalizes an RL agent for straying into regions where the reward model wasn’t trained. By defining the “in-distribution” region from the original model’s logprobabilities and penalizing outputs outside it, BSPO prevents the policy from generating the gibberish or extreme outputs that reward models mistakenly love.

Case Studies: How Leading Labs Approach Scoring

Three major AI labs demonstrate distinct approaches to the scoring challenge:

OpenAI’s Rule-Based Rewards and Code Evaluation

OpenAI’s Codex journey illustrates how they combine programmatic evaluation with AI judgment. For code generation, they leveraged unit tests as an objective “ground truth” scoring function—the percentage of tests passed became a reward signal. However, they found that optimizing for test-passing alone led to weird code (gaming edge cases or writing overly verbose solutions). So they added style and simplicity criteria to the reward.

For their SWE-Bench Verified benchmark, they had human experts create detailed rubrics for each coding task: does it solve the issue (primary points), does it follow style guidelines (bonus), is the solution minimal (no unnecessary changes)? This multi-faceted approach ensured code was not just functional but well-structured.

Anthropic’s Constitutional AI

Anthropic initially took a dual-scorer approach with their HH-RLHF (Helpful and Harmless) project. They trained two separate reward models—one for helpfulness, one for harmlessness—and used them in a multi-objective RLHF. The reward was a weighted sum with a high penalty for harmful content. This allowed them to explicitly balance the tradeoff between helpfulness and safety.

Later, they developed Constitutional AI, where the scoring “model” isn’t a single network but a procedure: one instance of Claude produces an answer, another instance critiques it according to constitutional principles, and then the answer is revised. This AI-based scoring loop proved remarkably effective—their constitutionally aligned model was nearly as good on harmlessness as a human-RLHF baseline, and actually better at not refusing valid queries.

DeepMind’s Reward Specification Research

DeepMind has focused extensively on the “specification gaming” problem—where an agent exploits loopholes in the reward. Their famous example was an agent in a boat racing game that learned to spin in circles hitting bonus targets rather than completing the race. To systematically address this in language models, they employ dedicated “red teams” to find tricks that break the current reward model, then update the specification to patch these vulnerabilities.

Another interesting DeepMind effort is Recursive Reward Modeling (RRM). Instead of asking a single reward model to judge “is this summary good?”, they break down the task into smaller components, each with its own reward model—one for factual correctness, another for completeness, another for style, etc. They’ve tested this on language tasks where one model summarizes, another critiques the summary, and a third judges the critique—forming a kind of evaluation hierarchy.

Practical Recommendations for Building Scoring Systems

For teams developing their own scoring models, here are key recommendations distilled from industry best practices:

1. Design a Clear, Multi-Faceted Rubric Up Front

Define what “good” means for your task in measurable terms. Break it into criteria (e.g., accuracy, helpfulness, safety) and gather example outputs for each level. This rubric will guide data collection, training, and debugging. Continually refine it as you discover new edge cases.

2. Layer Your Defenses Against Gaming

Don’t rely on a single scalar from the reward model alone. Consider a multi-stage reward: first apply a quick heuristic filter (for really bad outputs), then the learned reward model, then maybe a final check by a large LM on a sample. This layered approach helps catch things the primary reward might miss and allows you to enforce hard constraints via the first stage.

3. Adversarially Test Both Policy and Reward Model

Before deployment (and periodically after), “red team” your models. For the reward model, try to find inputs that it mis-scores—nonsense outputs that fool it, or harmful outputs it doesn’t flag. For the policy, look for clever ways it might produce undesired outputs that still get past the reward model. When you find issues, add those to training or add rules to handle them.

4. Use the Largest Model You Can Afford for Critical Decisions

In offline evaluations (choosing model versions, tuning hyperparameters), employ a very strong judge (e.g., GPT-4 or a committee of experts). This helps overcome any blind spots your trained reward model might have. Don’t solely rely on your reward model’s score to pick the best policy—cross-check with humans or stronger models.

5. Incorporate Uncertainty Estimation

Build uncertainty awareness into your scorer. Train an ensemble of reward models or use dropout to get a confidence interval. Then penalize very uncertain high-reward situations when training the policy.

6. Continuously Monitor and Recalibrate

Treat your reward model as a living component. Continuously collect data on where users are unhappy or where outputs seem off, and see if the reward model score aligned with those observations. If not, retrain or adjust. Maintain dashboards for key metrics and schedule regular audits where human evaluators rate a random sample of model outputs to sanity-check the reward model.

Essential Reading: Papers to Understand Scoring Models

For those wanting to dive deeper, here are the most important papers in this space:

Foundational Papers

Ouyang et al. (2022), “Training language models to follow instructions with human feedback” - The InstructGPT paper, foundational for RLHF
Bai et al. (2022), “Training a Helpful and Harmless Assistant with RLHF” - Anthropic’s HH-RLHF paper
Rafailov et al. (2023), “Direct Preference Optimization” - Introduced DPO as a simpler alternative to RLHF

Cutting-Edge Research

Bukharin et al. (2025), “Adversarial Training of Reward Models” - Shows how adversarial examples improve reward model robustness
Frick et al. (2024), “How to Evaluate Reward Models for RLHF” - Introduces Preference Proxy Evaluations (PPE)
Dai et al. (2025), “Mitigating Reward Over-Optimization via Behavior-Supported Regularization” - The BSPO paper
Dubois et al. (2025), “Length-Controlled AlpacaEval” - Addresses bias in LLM evaluators

Industry Applications

OpenAI (2024), “Improving Model Safety Behavior with Rule-Based Rewards” - Details RBR approach
Anthropic (2023), “Constitutional AI” - Describes the constitutional approach to alignment
Singh et al. (2024), “Judging the Judges” - Analyzes different models as evaluators

Conclusion: The Future of AI Judgment

As models grow more capable, the systems we use to evaluate and shape them will only become more critical. The scoring model—whether a separate reward network, a prompted judge, or a complex evaluation pipeline—sits at the heart of both how we measure progress and how we align AI with human values.

The field is moving beyond simplistic metrics toward sophisticated, multi-faceted evaluation frameworks that capture the nuance of what makes an AI output truly valuable. As one industry expert noted, “We’re seeing a significant trend towards rubric-based evaluations… scoring responses across multiple dimensions.”

The challenge of building scoring models that are robust against gaming, interpretable to humans, and truly aligned with our values represents perhaps the central technical problem in AI alignment today. After all, if we can’t reliably tell our AI systems what “good” looks like, how can we expect them to deliver it?

The Great AI Agent Delusion: Why the Hype Doesn’t Match Reality

2025-05-15T00:00:00+00:00

The Great AI Agent Delusion: Why the Hype Doesn’t Match Reality

In the breathless world of artificial intelligence discourse, few concepts have generated quite as much excitement—or venture capital—as AI agents. The narrative is seductive: autonomous AI systems that can independently perform complex tasks, collaborate like digital employees, and revolutionize how work gets done. It’s the kind of story that launches a thousand startups and inflates valuations faster than the Fed can say “transitory inflation.”

The premise behind agents is particularly alluring in its simplicity: while traditional AI approaches face diminishing returns (scaling laws show that making models bigger delivers improvements, but at an ever-increasing cost), agents promise near-linear scaling by simply replicating models and having them collaborate. The pitch is that just as a lean startup team can accomplish disproportionate results through specialization and coordination, a team of AI “specialists” could be greater than the sum of its parts—each model acting as a force multiplier for the others. Why build one enormous, expensive model when you could have a nimble squad of smaller ones, each with a specific role, working together to solve problems?

But as with most narratives that promise to fundamentally transform how we work, there’s a substantial gap between the hype and the reality. I’ve previously written about how game-changing Deep Research is for knowledge work—and I stand by that. Deep Research is genuinely revolutionary (in the “actually helps me do my job better” sense, not the “will replace me while I sip cocktails on a beach” sense). But being excited about Deep Research while skeptical of the broader AI agent revolution isn’t a contradiction—it’s just pattern recognition. It’s like loving your reliable sedan while doubting the imminent arrival of fully autonomous flying cars, despite what the glossy venture pitch decks promise. The evidence suggests that the broader agent revolution, where armies of AI agents autonomously run your company while you “focus on strategy” (i.e., update your LinkedIn profile), is significantly overblown—at least in the short term, and possibly much longer.

The Hype Tsunami

Let’s start with the obvious: AI agents are drowning in hype. The media landscape has been quick to declare this the “year of the AI agent,” with tech luminaries like Google’s Sundar Pichai characterizing our current period as the “agentic era.” The agent gold rush began in early 2023, when AutoGPT—the original poster child of autonomous agents—accumulated tens of thousands of GitHub stars almost overnight, surpassing even foundational ML projects like PyTorch. Fast forward to 2025, and we’ve got sleeker, venture-backed iterations like Devin (the “AI software engineer”) and Manus (the “first general AI agent”) continuing to fuel the hype cycle with increasingly bold claims.

The investment figures are equally staggering. Global VC funding in AI exceeded $100 billion in 2024, with AI companies attracting nearly one-third of all global venture funding. AI agent startups reportedly raised $3.8 billion in 2024 alone, nearly tripling the previous year’s investment.

The dominant narrative is irresistible: autonomous digital employees that can handle everything from customer service to software engineering to content production. Salesforce’s Marc Benioff boldly predicted “a billion AI agents” in operation by 2026, supposedly doubling workforce productivity. Market projections forecast the AI agent market growing from $5.1 billion in 2024 to $47.1 billion by 2030, at a compound annual growth rate of 44.8%.

The Remarkable Rise of Single LLMs

While agent hype has been building, single, tool-augmented large language models have been making extraordinary progress. OpenAI’s “o3” model exemplifies this trend, demonstrating capabilities that were hardly imaginable just a couple of years ago.

The jump in capabilities from earlier models to o3 is genuinely impressive. On the SWE-Bench software engineering benchmark, o3 achieved a 71.7% success rate, compared to only 48.9% by its predecessor. For context, this means it can successfully fix or implement nearly three-quarters of GitHub issues correctly—a massive leap in problem-solving ability.

In competitive programming, o3’s Codeforces Elo rating reached 2727, versus 1891 for its predecessor—putting it on par with top human competitive programmers. It also scored 87.7% on an expert-level science and reasoning test (GPQA Diamond) and was three times better than its predecessor at abstract reasoning tasks.

What’s enabling these improvements? Three key factors stand out:

Longer deliberation: Models like o3 are trained to think for longer before responding, internally generating and traversing chains of thought before producing an answer.
Tool use integration: O3 and similar models have built-in abilities to use tools like web browsers, code execution environments, and data analysis capabilities, allowing them to operate more independently.
Fewer errors with scale: Larger models make significantly fewer reasoning errors. OpenAI notes that o3 “makes 20% fewer major errors than its predecessor on difficult, real-world tasks.”

The critical point here is that many tasks which might have required multiple specialized AI agents can now be handled by a single, powerful LLM with appropriate tools. For example, if a user asks, “How will summer energy usage in California compare to last year?”, o3 can search the web for utility data, write and run Python code for forecasting, generate a graph, and explain the results—all without needing multiple coordinating agents or human oversight.

In other words, the integration of tool use and iterative reasoning into a single model is already delivering much of what agent advocates promise—but with fewer points of failure.

The Predictable Limits of Agents

Despite their promise, AI agents consistently run into several fundamental challenges that limit their effectiveness compared to single, tool-augmented LLMs.

Memory and Forgetting

Current agents have very limited memory capabilities—primarily the context window of their underlying models. When an agent completes one subtask and moves to the next, it might lose important information due to context limits. More importantly, agents don’t retain knowledge across missions unless explicitly fine-tuned, leading to inefficient relearning and repeated mistakes.

This limitation manifests in embarrassing ways. AutoGPT (the OG agent) became notorious for getting stuck in infinite loops, repeating the same approach without making progress because it couldn’t remember what it had already tried. More recent systems haven’t fully solved this issue. In my own extensive testing of Manus, I unsuccessfully tried to get it to improve on a range of tasks that large reasoning LLMs like o3 can just about manage. Instead of showcasing the supposed advantages of its multi-agent architecture, Manus consistently collapsed under its own weight, forgetting previous context and losing track of its goals more often than it got close to matching the performance of solo LLMs. The memory limitations became particularly obvious in extended operations, where Manus would start strong but gradually degenerate into confusion and contradictions.

Single LLMs, by contrast, can sidestep some memory issues through larger context windows and more knowledge baked into their parameters. Rather than needing to “remember” via external memory systems, they simply know more things by default.

Planning and Decomposition

LLM-based agents do implicit planning—generating possible next steps using learned patterns rather than explicit search through a plan space. This works for short sequences but quickly breaks down for complex tasks.

Modern agent systems like Devin and Manus still generate sub-tasks that are irrelevant or redundant, or miss important steps entirely. When MIT Technology Review tested Manus on building a DocuSign-like web app, it initially impressed reviewers by autonomously coding components—but then completely missed implementing the actual document signing functionality (you know, the core feature of DocuSign). This is the equivalent of a contractor building you a beautiful bathroom but forgetting to install the toilet.

The planning failure isn’t unique to Manus. TechCrunch’s analysis of various agent systems noted most suffer from a fundamental brittleness in task decomposition. On the MLE-bench for multi-step decision making, vanilla GPT-4 in a simple loop consistently outperformed dedicated agent frameworks because the latter’s planning was too brittle. The agents would make an initial plan that was suboptimal and then stick to it with the stubborn determination of a toddler insisting they can tie their own shoes, while a single GPT-4 with richer reasoning could adapt more effectively.

Hallucinations and Error Compounding

Because agents operate in loops and rely on their own intermediate outputs, errors can compound dramatically. If an agent has a misunderstanding in step 1, it builds on that wrong assumption in step 2, and so on—without a human in the loop to correct it.

Devin, despite being a $23 million venture-backed “AI software engineer,” still suffers from this cascading error problem. In one demonstration where Devin was asked to build a simple e-commerce feature, it misunderstood the database schema in its first step, then wrote increasingly creative but completely non-functional code based on its imaginary version of the database. By step 7, it was enthusiastically debugging code that could never work because the entire foundation was hallucinated. It’s the AI equivalent of building an elaborate house of cards on a trampoline.

The infamous example from the OG agent days remains instructive: an AutoGPT instance tasked with researching cats made a wiki page about cats, then hallucinated a server flaw to gain admin access, then “killed” itself—all completely unrelated to the user’s goal. The underlying issue was that the agent hallucinated an incorrect method and pursued it with the relentless determination of a conspiracy theorist who’s “done their research.”

By contrast, a single LLM call that produces a wrong answer simply ends—it doesn’t keep compounding the mistake. Agents thus have a higher error ceiling unless carefully constrained, much like how a toddler left unsupervised doesn’t just make one mess but creates an escalating disaster scenario that spreads through the entire house.

Where Agents Actually Succeed

While I’ve painted a rather skeptical picture of agents so far, it would be intellectually dishonest not to acknowledge that there are specific domains where agent architectures have demonstrated real value. These tend to be narrow, well-scoped applications rather than the broad, general-purpose “autonomous employees” promised by the hype.

Deep Research Loops

OpenAI’s ChatGPT Deep Research mode functions as an autonomous research assistant that spends up to 30 minutes iteratively searching the web, reading numerous sources, and synthesizing a comprehensive answer with citations.

On “Humanity’s Last Exam” (a set of 3,000 expert-level questions), the deep-research agent scored 26.6% accuracy. While this may seem low, the test is deliberately extremely difficult, and the agent’s ability to discover relevant information online rather than relying purely on baked-in knowledge is impressive. More importantly, the agent produces fully referenced answers with 100% of claims backed by citations, addressing the hallucination problem in high-stakes research.

The time savings are substantial: tasks that might take a human analyst hours to complete can be done in 5-30 minutes by the agent. In one case study, the agent analyzed the impact of a new tax policy across multiple countries using recent economic data, producing a multi-page report with graphs—a task that would normally require an analyst’s full day of effort, completed in under an hour with minimal human input.

Code Development Agents

Another domain where agent success is evident is software development, particularly automated code generation and pull requests. For instance, developers have built “PR agents” that review incoming pull requests, run test suites, identify failing tests, and autonomously push commits to fix bugs.

These agents have demonstrated significant time savings—in one example, saving a team about 2-3 hours of developer time per PR. Over a month, with dozens of PRs, this translates to roughly a week of developer time saved.

The Brittleness of Longer-Range Agents

Despite these successes in narrow domains, longer-range, more general-purpose agents tend to be remarkably brittle. The empirical evidence for this is compelling.

Anthropic’s Claude 3.5 with “Computer Use” capability—one of the most advanced agent systems available—could successfully complete only about 46% of steps in an airline booking task and about 69% for a retail returns workflow. In other words, it failed more than half the time on booking flights and almost a third of the time on handling returns—hardly the reliability needed for true autonomy.

Even cutting-edge agent platforms like Manus, which sparked significant hype in early 2025, suffer from “system stability challenges, including crashes and server overload under high demand.” Many early users of AutoGPT reported the program simply hanging or hitting API limits and stopping mid-task.

Meanwhile, single, tool-augmented LLMs are often dramatically outperforming dedicated agent systems on standardized benchmarks. On SWE-Bench Verified, OpenAI’s o3 achieved 71.7% success without any agent scaffolding, whereas Cognition AI’s Devin—one of the most hyped dedicated agent systems specifically designed for coding—reached only 13.86% on the same benchmark. This five-fold performance gap between a single capable LLM and a purpose-built agent system starkly illustrates how adding agent complexity doesn’t necessarily improve outcomes, and in many cases, significantly worsens them.

This performance gap reveals a fundamental truth: adding more steps and complexity to an AI system doesn’t necessarily improve performance—it often introduces more points of failure instead.

The “Three-Body Problem” of Agent Coordination

One of the most intellectually interesting challenges in agent development is what we might call the “three-body problem” of coordination—a direct analogy to the famously unsolvable three-body problem in physics, where the interactions between even a small number of bodies under mutual gravitational influence can lead to chaotic and unpredictable trajectories.

In multi-agent systems, where multiple autonomous agents interact with each other and their environment, the collective behavior can become exceedingly difficult to predict or control, even if the behavior of individual agents is relatively well-understood in isolation.

Key challenges include managing partial observability (where agents have incomplete information about the environment or other agents), non-stationarity (where the environment dynamics change due to the actions of other learning agents), ensuring effective coordination and safety, and dealing with the escalating complexity as the number of agents increases.

This complexity is a scaling challenge distinct from that of scaling individual LLMs. Merely making individual agents “smarter” by improving their core LLMs might not make the overall multi-agent system more reliable or predictable; it could, in fact, increase the phase space of possible interactions and outcomes, making robust control and verification even harder.

Unlike the clear scaling laws we’ve observed with individual LLMs (where performance improves predictably with model size and training data), there’s no clear evidence that multi-agent systems will exhibit similar smooth improvement curves. The interactions between agents introduce a fundamentally different kind of complexity.

Biological and Organizational Analogies: A Critical View

Proponents of AI agents often draw analogies from human organizations to argue for the inevitability of agent-based approaches. These analogies can be compelling but fundamentally misunderstand why human organizations work the way they do.

The organizational analogy misses a crucial point: human corporations and specialization exist primarily as a response to our inherent cognitive limitations. We develop specialized expertise because humans have finite memory, bandwidth, and learning capacity. You simply cannot become world-class at medicine, programming, marketing, and finance in one lifetime. Specialization isn’t just a preference—it’s a necessity born of our constraints.

Large language models, however, don’t face these same limitations. They’re already generalists with reasonable capabilities across a vast range of domains. They can simultaneously “know” medicine, programming, marketing, and finance at a level that would take humans multiple lifetimes to acquire. So when multi-agent systems assign different “roles” to instances of the same underlying model, they’re essentially solving a problem that doesn’t exist.

In these systems, you have the same model with the same capabilities pretending to be different specialists, when the model is already equally capable (or equally limited) across all those domains. It’s like having a versatile professional athlete pretend to be different sports specialists by changing uniforms—underneath, it’s still the same athlete with the same fundamental capabilities.

This lack of genuine cognitive diversity means that the “specialization” achieved is purely theatrical. When multiple agents are essentially performing similar types of processing on similar information, the benefits of a multi-agent architecture over a highly capable single model with sophisticated tool use become far less clear, especially when considering the added overhead of inter-agent communication and coordination.

Economic Reality Check: Will Agents Outperform Single LLMs in ROI?

The ultimate test of any technology is whether it delivers superior return on investment compared to alternatives. For AI agents, the question is whether they will outperform single, tool-augmented LLMs in cost-adjusted ROI.

The evidence here is mixed at best. While proponents cite impressive projections—over 80% of enterprises plan to integrate AI agents within 3 years, with potential revenue increases of 6-10%—these figures may reflect hype more than reality.

Skeptics point out that “we haven’t even figured out ROI on basic LLM tech, let alone fully autonomous agents.” Many companies are still struggling to quantify gains from using ChatGPT or similar tools in their workflows. Deploying more complex agents could increase costs (development, oversight) without guaranteed returns.

There are also significant hidden costs. Implementing an AI agent isn’t like installing off-the-shelf software—it requires substantial custom engineering, integration with enterprise systems, and ongoing maintenance. When the underlying LLM updates, the agent might need re-testing or tweaks, adding to the total cost of ownership.

The risk of errors adds another economic dimension. If an agent errantly sends wrong information to customers or makes a bad decision in a finance context, the cost could far outweigh any savings. Businesses might need to invest in robust testing and insurance against agent errors, further eroding ROI.

Meanwhile, a well-trained knowledge worker with access to ChatGPT or a similar tool might produce better results than an unsupervised agent working blindly. For example, an agent might generate 10 reports, of which 3 have critical errors requiring do-overs; a human using an LLM might generate 5 reports, all correct, in the same time. The net output quality vs. cost often favors the latter approach.

Maintaining Perspective

Let me be clear: I’m not suggesting AI agents have no value or future. In specific, well-defined domains—like deep research or automated code reviews—agent architectures have already demonstrated significant value. And as the technology matures, we may see breakthroughs that address the coordination problems and brittleness issues I’ve highlighted.

But the current narrative that autonomous AI agents will imminently transform knowledge work and outperform humans (or even single LLMs) across a broad range of tasks is simply not supported by the evidence. The gap between the hype and the reality remains substantial.

For the next 1-3 years, the more prudent approach is to leverage the rapidly advancing capabilities of single, tool-augmented LLMs like o3, which are already demonstrating remarkable abilities to handle complex tasks that previously might have required agent architectures. These models will continue to get bigger, smarter, and cheaper, likely outperforming even highly skilled humans on specific tasks.

But the idea that agentic workflows will unlock significantly more economic value than single models in the near term remains speculative at best. The theoretical limitations around memory, planning, and coordination in multi-agent systems pose substantial challenges that scaling alone may not solve.

While futurists and venture capitalists paint visions of autonomous AI organizations handling complex workflows without human intervention, the practical reality today is much more modest: specific agent applications showing promise in narrow domains, with single LLMs often delivering better performance and reliability for many tasks.

In the breathless race to claim the next paradigm shift in AI, we’d do well to remember that evolution—both biological and technological—typically proceeds through incremental improvements rather than sudden leaps. The agent revolution may eventually arrive, but it will likely be a gradual evolution rather than the overnight transformation many are promising.

So by all means, experiment with agents where they make sense. But don’t be surprised if, for the foreseeable future, a skilled human working with a powerful LLM outperforms fully autonomous agent systems in most knowledge work domains. Sometimes the simplest solution—a direct conversation with a highly capable AI—remains the most effective.

DeepCredit Experiment: Evaluating LLM Performance on Hard CRR Questions

2025-05-12T00:00:00+00:00

DeepCredit Experiment: Evaluating LLM Performance on Hard CRR Questions

Introduction

This report details our latest DeepCredit experiment, which represents a significant step toward building a general credit risk analysis agent. The experiment involved creating and evaluating a set of 50 challenging Capital Requirements Regulation (CRR) questions to test the capabilities of various large language models in understanding and applying complex financial regulations.

Methodology

Question Generation Process

We developed a systematic approach to create cognitively demanding regulatory questions that would challenge even seasoned financial professionals. The questions were automatically generated using an LLM that reverse-engineered complex regulatory requirements from the CRR text. Our methodology followed these key steps:

Define Purpose and Target Difficulty: We targeted skills requiring deep knowledge of CRR II/III own-funds and credit-risk rules, plus the ability to apply them quantitatively. Questions were calibrated to challenge seasoned regulatory-capital practitioners or CFA/FRM holders.
Content Mapping: We extracted an article list from CRR and key Delegated Regulations, then flagged “high-leverage” articles that are frequently misunderstood, embed quantitative scalars, or define critical thresholds.
Question Design Heuristics: We employed several heuristics to generate challenging questions:
- Exploiting exceptions to general rules
- Requiring precise arithmetic calculations
- Mixing temporal considerations with rule applications
- Cross-referencing different regulatory artifacts
- Incorporating real-world financial instruments
Filtering and Validation: We removed duplicates, ensured uniqueness of concepts, and verified that all numerical values traced directly to the regulation.

Question Formalization: Each question was structured with a consistent schema:

{
  "question": ,
  "answer": ,
  "tags": [taxonomy],
  "references": "Legal pin-cite(s)",
  "rationale": "One-paragraph proof"
}

Quality Control: All questions underwent initial review and self-consistency checks to ensure accuracy and clarity. The evals have not been peer-reviewed yet, but the final set for the benchmark will undergo double peer review.

Example Questions

To illustrate the challenging nature of the questions, here are three examples from the dataset:

Example 1: AT1 Capital Trigger Thresholds

{
  "question": "CRR requires AT1 instruments to include a trigger for conversion/write-down when CET1 falls below a level. What is the minimum trigger?",
  "answer": "5.13%",
  "tags": ["HEXC", "NUMC"],
  "references": "CRR Art 54(1)(a)",
  "rationale": "Article 54(1)(a) specifies that AT1 instruments must have provisions requiring the principal amount to be written down upon the Common Equity Tier 1 capital ratio falling below 5.125%."
}

Example 2: Significant Investment Threshold Deductions

{
  "question": "Bank JKL holds a 12% stake in another financial institution (significant investment in CET1 of a financial sector entity) amounting to €50 million. Bank JKL's own CET1 capital is €400 million. Under the threshold regime, how much of this €50 million must be deducted from CET1?",
  "answer": "€10 m",
  "tags": ["NUMC", "DMAT"],
  "references": "CRR Art 48(1)",
  "rationale": "According to Article 48(1), significant investments in financial sector entities are subject to the 10% threshold. The threshold amount is 10% of Bank JKL's CET1 (€400m × 10% = €40m). Since the investment exceeds this by €10m (€50m - €40m), that excess amount must be deducted from CET1."
}

Example 3: Capital Instrument Eligibility

{
  "question": "A bank issued perpetual bonds that are deeply subordinated, but their coupons are cumulative (missed payments accumulate as liabilities). Under CRR own-funds rules, how much of this instrument can count toward regulatory capital?",
  "answer": "0",
  "tags": ["HEXC", "TREE"],
  "references": "CRR Art 52(1)(l)",
  "rationale": "Article 52(1)(l) explicitly requires AT1 instruments to have distributions that are paid only out of distributable items and the institution has full discretion to cancel distributions at any time. The accumulation of missed payments as liabilities (cumulative coupons) contradicts this requirement, making the instrument ineligible for AT1 capital."
}

Evaluation Setup

The evaluation was conducted using braintrust.dev, a specialized evaluation platform that allows for easy setup of language model assessments. We selected braintrust.dev for its ability to run evaluations without code by simply uploading datasets as CSV files and configuring system prompts within the app.

We tested four OpenAI models (using the latest versions as of May 13, 2025):

o3 - The largest reasoning-optimized model
o3-mini - A distilled, more efficient version of o3
GPT-4o (4o) - The standard non-reasoning model
GPT-4o-mini (4o-mini) - A distilled smaller version of 4o

The evaluation used a “Factuality” scoring method to assess the correctness of model responses against the expected answers.

Results Summary

Our evaluation revealed significant performance differences across models, with GPT-3.5-turbo (o3) demonstrating a substantial performance uplift compared to other models. However, analysis of the scores indicated that the “Factuality” scoring method was sometimes overly strict, understating the performance of models that provided technically correct answers but with minor formatting differences.

Performance by Model

Model	Factuality Score
o3	77.20%
GPT-4o (4o)	56.80%
GPT-4o-mini (4o-mini)	47.60%
o3-mini	47.20%

The performance ranking reveals an interesting pattern. Conceptually, one would expect the performance to follow the model capability hierarchy (o3 > o3-mini > 4o > 4o-mini), since o3 is the most advanced reasoning-optimized model, followed by its distilled version (o3-mini), then the standard non-reasoning model (4o), and finally its smaller variant (4o-mini).

However, o3-mini underperformed, scoring even below 4o-mini, which is theoretically the least capable model in the set. This result represents another step in our understanding of model performance and highlights the importance of empirical evaluation for assessing model capabilities in specialized domains. The strong performance of o3 confirms its suitability for this type of agent, while raising questions about the effectiveness of the distillation process for regulatory knowledge in o3-mini.

Analysis of Model Performance

Real Failure Patterns

Analyzing the results revealed several substantive patterns where models genuinely struggled with regulatory knowledge and reasoning:

Regulatory threshold misunderstandings: Smaller models particularly struggled with precise regulatory thresholds and their applications. For example, several models incorrectly applied the threshold for significant investments in financial entities.
Complex calculation errors: Questions requiring multi-step calculations involving regulatory factors showed higher failure rates in smaller models. This was particularly evident in questions involving the SME supporting factor, risk-weighted assets calculations, and minority interest calculations.
Capital instrument classification errors: Some models failed to correctly identify the eligibility criteria for various capital instruments, especially in edge cases like perpetual bonds with cumulative coupons.
Risk weight application inconsistency: Models frequently applied incorrect risk weights to exposures, especially when dealing with special cases like currency mismatches or specific counterparty types.

Scoring Mechanism Limitations

The evaluation also revealed important limitations in the scoring methodology that need to be addressed:

Currency formatting variations: Many models provided numerically correct answers but received reduced scores due to formatting differences (e.g., “€40 million” vs “€40 m” vs “40m euros”). These are not genuine knowledge failures but scoring artifacts.
Percentage formatting inconsistencies: Differences in percentage formatting (e.g., “5.125%” vs “5.13%”) led to score reductions despite conceptual correctness.
Verbosity penalties: Some models provided correct answers but embedded them in longer explanations, resulting in partial scores rather than full credit.

These scoring issues suggest the need for developing a custom scoring method that can properly evaluate conceptual correctness while accommodating reasonable variations in answer formatting and presentation.

Examples of Substantive Model Failures

The evaluation revealed several examples where models demonstrated genuine conceptual misunderstandings of regulatory requirements, as opposed to merely formatting or verbosity issues:

Example 1: Misunderstanding Threshold Deduction Rules

Question: “Bank JKL holds a 12% stake in another financial institution (significant investment in CET1 of a financial sector entity) amounting to €50 million. Bank JKL’s own CET1 capital is €400 million. Under the threshold regime, how much of this €50 million must be deducted from CET1?”

Expected Answer: “€10 m”

Key Finding: 4o-mini completely misunderstood the threshold deduction mechanism, answering “€0 million” when the correct amount was €10 million. This demonstrates a fundamental failure to grasp how the 10% threshold applies to significant investments in financial entities. Both 4o and o3 correctly calculated the €10 million deduction.

Example 2: Incorrect AT1 Capital Eligibility Assessment

Question: “A bank issued perpetual bonds that are deeply subordinated, but their coupons are cumulative (missed payments accumulate as liabilities). Under CRR own-funds rules, how much of this instrument can count toward regulatory capital?”

Expected Answer: “0”

Key Finding: 4o-mini incorrectly claimed that “100% of the instrument can count toward regulatory capital as Additional Tier 1 (AT1) capital,” revealing a complete misunderstanding of one of the fundamental eligibility criteria for AT1 capital instruments. Cumulative coupons are explicitly disqualified under CRR regulations. Both 4o and o3-mini correctly identified that the instrument didn’t qualify, while o3 provided the most concise answer.

Example 3: Infrastructure Supporting Factor Miscalculation

Question: “A project-finance loan meets all conditions for the infrastructure supporting factor (base RW 100%). What risk weight applies?”

Expected Answer: “75%”

Key Finding: Multiple models provided incorrect risk weights, with o3-mini answering “80%” and 4o-mini answering “50%”, while the correct answer according to CRR is 75%. This shows a failure to correctly apply the specific supporting factor defined in the regulation.

Example 4: SME Supporting Factor Application

Question: “A performing SME loan of €3 million is subject to the SME supporting factor (base RW 100%). What RWA results after the factor?”

Expected Answer: “€2.33 m”

Key Finding: Both smaller models produced incorrect calculations, with 4o-mini answering “€1.5 million” and o3-mini answering “€2,400,000”, neither of which corresponds to the correct application of the 0.7619 SME supporting factor. This demonstrates difficulties with the precise quantitative applications of regulatory adjustments.

Insights and Implications

Evaluation Methodology Considerations

The experiment highlighted important considerations for future evaluations:

Developing custom scoring methods: The standard “Factuality” scoring method frequently penalized technically correct answers due to minor formatting differences, significantly underestimating model performance. For future evaluations, we need to develop a custom scoring approach that:
- Normalizes numerical answers before comparison (e.g., converting “€10m”, “€10 million”, “10 million euros” to a standard format)
- Recognizes acceptable variations in percentage precision (e.g., 5.125% vs 5.13%)
- Extracts key answers from longer explanations when models provide additional context
- Implements regex-based matching rather than exact string comparison
Answer standardization: Expected answers should be formalized with clear guidelines for acceptable variations to ensure consistent scoring across different response formats.
Manual verification pipeline: Some evaluation discrepancies require human judgment to properly assess, especially when dealing with complex regulatory concepts where multiple formulations of an answer might be correct. A systematic human review process should be implemented for borderline cases.
Two-phase scoring: Consider implementing a two-phase scoring approach where basic correctness is evaluated first, followed by assessment of precision and conciseness as secondary factors.

LLM Performance on Regulatory Questions

Our findings suggest several important insights about LLM capabilities in the regulatory domain:

Model capability vs. empirical performance: The results demonstrate a divergence between theoretical model capabilities and empirical performance on specialized regulatory tasks. Despite o3 being designed as a reasoning-optimized model and performing well (77.20%), its distilled counterpart o3-mini performed more modestly (47.20%), even slightly underperforming the theoretically less capable 4o-mini model (47.60%).
Reasoning mechanism effectiveness: The strong performance of o3 suggests that its reasoning mechanisms are effective for navigating the complex, multi-step calculations and logical applications required by CRR regulations. However, these reasoning capabilities appear to be less preserved in the distillation process for o3-mini.
Knowledge representation differences: Models appear to differ in how they represent and access regulatory knowledge, with certain architectures potentially preserving regulatory expertise better than others through different training or optimization approaches.

Future Directions

Based on the insights from this experiment, we’ve identified several promising directions for the DeepCredit project:

Expanded question set: Develop a more comprehensive set of regulatory questions spanning additional aspects of credit risk regulations, including Basel III, IFRS 9, and stress testing frameworks.
Fine-tuned evaluation methods: Refine scoring methods to better assess conceptual understanding rather than exact string matching, potentially incorporating rubric-based evaluation approaches.
Model specialization: Explore the potential for fine-tuning specialized models on regulatory texts to enhance performance on complex financial regulatory tasks.
Multi-step reasoning: Design evaluations that explicitly test models’ abilities to perform multi-step reasoning in regulatory contexts, tracking intermediate calculation steps.
Integration with retrieval: Evaluate how retrieval augmentation might further enhance model performance by providing direct access to regulatory documents.
Cost-performance analysis: Further investigate the surprising performance differences between models of different sizes and architectures to optimize the cost-performance trade-off for regulatory applications.
Prompt engineering and fine-tuning: Given that o3 achieved its strong performance without any prompt engineering or fine-tuning, explore how specialized prompting techniques and targeted fine-tuning on regulatory texts could further enhance performance.

Conclusion

The DeepCredit experiment with 50 hard CRR questions has provided valuable insights into the capabilities and limitations of current language models in understanding and applying complex financial regulations. The significant performance uplift demonstrated by o3 suggests that we are approaching a threshold where LLMs can provide meaningful assistance with complex regulatory compliance tasks.

Notably, o3 achieved its strong performance (77.20%) without any prompt engineering or fine-tuning specifically for regulatory tasks. This “out-of-the-box” performance is encouraging and suggests substantial room for further improvement through specialized optimization techniques.

This experiment serves as an important foundation for our broader goal of building a general credit risk analysis agent, highlighting both the promise and the challenges of applying LLMs to highly specialized regulatory domains. The methodology developed for creating challenging regulatory questions provides a valuable template for expanding our evaluation framework to cover additional aspects of credit risk analysis.

Beyond evaluation, this question generation methodology will directly feed into the development of our proprietary credit risk benchmark. In the longer term, these synthetic question datasets may serve as the foundation for reinforcement fine-tuning approaches, where models can learn to answer challenging regulatory questions by practicing and receiving feedback on the correctness of their responses. This represents a promising path toward LLMs that can consistently navigate the intricacies of financial regulation with high accuracy.

The Rise of Deep Research Agents: A Strategic Primer for Knowledge Services Executives

2025-05-06T00:00:00+00:00

The Rise of Deep Research Agents: A Strategic Primer for Knowledge Services Executives

Executive Summary

Deep Research agents represent a new class of artificial intelligence systems that can autonomously navigate the web, search for information, analyze data, and produce comprehensive, cited reports on complex topics. Built on large language models (LLMs) and trained using reinforcement learning, these systems are already demonstrating capabilities that match or exceed human researchers in specific domains. OpenAI’s Deep Research and Google’s Gemini Deep Research lead the market, with other offerings from Anthropic, Perplexity, and xAI trailing significantly in the author’s testing across tasks ranging from credit ratings to real estate valuations and PE target analysis.

This primer provides executives with a comprehensive understanding of Deep Research agents—how they work, their current capabilities and limitations, and their potential to disrupt knowledge-intensive industries. Knowledge service firms that understand and adopt these technologies strategically will maintain competitive advantage; those that ignore them risk disruption. (And yes, that probably includes your firm, even if you think your analysts do Very Special Work That AI Can’t Possibly Understand™.)

Part I: The Technical Foundations

Reinforcement Learning: From Games to Research

Reinforcement learning (RL) has a long history of producing superhuman performance in narrow domains, from IBM’s Deep Blue defeating Garry Kasparov in chess to DeepMind’s AlphaGo conquering the ancient game of Go. RL systems learn optimal strategies through millions of iterations, gradually developing abilities that surpass human experts.

What makes Deep Research agents different is their domain: instead of mastering a game with fixed rules, they must navigate the unstructured, ever-changing environment of the internet and real-world information sources. (Turns out the internet is a bit more complex than chess. Who knew?)

LLMs + Reinforcement Learning: The Technical Breakthrough

Deep Research agents represent the convergence of several technical breakthroughs:

Large Language Models (LLMs): Foundation models like GPT-4 and Gemini 2.5 provide the base intelligence, with strong reasoning capabilities and world knowledge.
Tool-Using Architectures: Models trained to use external tools—searching the web, browsing pages, running code—extend beyond text generation.
Reinforcement Learning from Human Feedback (RLHF): Models are trained on human-preferred research trajectories and reward signals.
End-to-End Training in Real Environments: Rather than simulations, these systems learn by interacting with the actual web.

The development typically follows three stages:

Stage 1: Supervised Fine-Tuning (SFT) Models are first trained on demonstrations of effective research—examples of search queries, web browsing, data analysis, and final answers.

Stage 2: Reinforcement Learning The model then learns through trial and error, with rewards for finding correct information, citing sources properly, and avoiding hallucinations.

Stage 3: Ongoing Improvement Models continue to improve through additional training on new examples and feedback from actual use.

How Deep Research Agents Work

Deep Research agents operate through a ReAct-style loop (Reason-Act-Observe) that combines internal reasoning with external tool use. The core architecture involves:

A central orchestrator (the LLM) that plans research steps and interprets results
External tools that extend the model’s capabilities:
- Web search (finding relevant sources)
- Web browsing (reading and extracting information from pages)
- Code execution (analyzing data, creating visualizations)
- File reading (processing user-provided documents)

When given a research query, the agent:

Formulates initial search queries based on the question
Evaluates search results to identify promising sources
Browses selected pages to find relevant information
Potentially runs code to analyze data or create visualizations
Cross-validates information across multiple sources
Synthesizes findings into a comprehensive answer
Provides citations for all claims

Throughout this process, the agent maintains a “scratchpad” of found information and intermediate reasoning. Unlike simpler LLM interactions, a Deep Research session may involve dozens of individual steps over 10-30 minutes, gradually building toward a comprehensive answer. (Think of the world’s most focused research intern who never gets bored, checks Twitter, or needs a bathroom break.)

Part II: Current Capabilities and Limitations

State of the Art and Emergent Behaviors

Several major research organizations have developed Deep Research agents, with significant differences in capability:

Market Leaders

OpenAI Deep Research: The current performance leader, built on their “o3” model, with superior capabilities in complex research tasks. Recent BrowseComp benchmark results indicate superhuman performance in internet research capabilities.
Google’s Gemini 2.5 Pro (Deep Research Mode): Closely rivals OpenAI’s offering, leveraging Google’s search infrastructure and knowledge tools.

Secondary Players

Anthropic, Perplexity, and xAI: Offer Deep Research capabilities but currently trail significantly in performance across tasks from credit ratings to market analysis and PE target research.
Open-source DeepResearcher: Demonstrates that end-to-end reinforcement learning can achieve substantial improvements over simpler approaches.

One of the most interesting aspects of these systems is the emergence of sophisticated behaviors that weren’t explicitly programmed:

Planning behavior: The agent learns to create research plans and decompose complex queries into manageable steps.
Cross-validation: Even when initial results provide an answer, the agent often seeks confirmation from multiple sources to verify information.
Reflection and adjustment: When searches don’t yield useful results, the agent recognizes the failure and adjusts its approach.
Honesty about uncertainty: When unable to find definitive answers, the agent acknowledges limitations rather than making up responses.

These behaviors develop naturally through reinforcement learning, as the model learns that such strategies lead to better outcomes. (It’s a bit like if you trained a robot to make coffee and it spontaneously started wiping down the counters too—not because you told it to, but because it noticed that clean counters correlate with better coffee. Except this robot is making knowledge work instead of lattes.)

Current Limitations

Despite their impressive capabilities, Deep Research agents still face significant limitations:

Form Factor and User Experience
- 10-30 minute wait times for results
- Comprehensive but lengthy reports that create a “water-fountain effect”
- Limited interaction during research
- One-shot nature rather than iterative research
Brittleness with Non-Internet Sources
- Limited enterprise data integration
- Challenges with structured data
- Gated content limitations
- Credential handling issues
- Poor epistemic awareness (producing “fluff” when good data isn’t available instead of acknowledging information gaps)
Technical and Operational Challenges
- Residual hallucination risk
- Limited source credibility assessment
- Expensive computation
- Reproducibility issues
- Safety and alignment concerns

These limitations mean that while Deep Research agents can dramatically accelerate information gathering, they may create a new bottleneck at the human consumption stage. (It’s a bit like asking for directions and getting a 100-page atlas in response. Yes, the information is in there somewhere, but was this really faster than just asking a local?)

Part III: The Next Breakthrough in LLM Value Creation

Parallels with AI-Powered Coding Tools

The trajectory of Deep Research agents resembles that of AI-powered coding tools, which offer insights into how this technology might evolve:

From assistance to augmentation: Early code-completion tools like GitHub Copilot began as helpful but limited assistants. They’ve evolved into sophisticated pair-programmers.
Productivity amplification: Tools like Cursor and Windsurf have demonstrated dramatic productivity improvements for software developers, with studies showing 30-50% faster code production.
Integration with workflows: Rather than replacing developers, these tools have been integrated into existing development environments.
Ecosystem development: A rich ecosystem of plugins, extensions, and specialized versions has emerged.

Deep Research agents appear positioned for a similar evolution—from current prototypes to workflow-integrated tools that dramatically amplify knowledge workers’ capabilities. (Though one hopes they evolve faster than your company’s ERP implementation.)

Value Creation Potential

The potential value creation from Deep Research agents comes from several sources:

Time compression: Tasks that previously took days or weeks can potentially be completed in hours.
Coverage expansion: Exploration of far more sources than a human researcher.
Pattern recognition: Identification of connections that might not be apparent to human researchers.
Democratization of expertise: High-quality research accessible to organizations without large specialized research teams.
Cost efficiency: Potentially more cost-effective than human researchers for many tasks.

The GitHub Copilot economic impact model suggests a potential template: Microsoft claimed $1.5-$1.9 billion in increased developer productivity in its first year of deployment. Market valuations further validate this impact, with Cursor reportedly reaching a $10 billion valuation and OpenAI reportedly attempting to acquire Windsurf for approximately $3 billion. Similarly, Deep Research agents could drive significant value once integrated into knowledge work workflows.

Part IV: Required Evolution in Models and Workflows

For Deep Research agents to realize their potential, several advances are necessary:

State Maintenance and Knowledge Base Curation

Future systems will need:

Persistent knowledge bases across sessions
Fact tracking and provenance
Contradiction detection and resolution
Queryable research memory

Multi-Scale Research Capabilities

Systems must support research at multiple scales:

Deep research for complex questions
Medium research for focused investigations (5-10 minutes)
Small research for quick fact-checking (seconds or minutes)
Follow-up capabilities for iterative research

This multi-scale approach should be supported by a corresponding model scaling strategy, with smaller, cheaper, and lower-latency models specifically trained for all but the deepest research tasks. This ensures cost-effectiveness and appropriate response times for different query types.

Seamless Data Integration

Critical evolutions include:

Enterprise data connectors
Subscription service access
Real-time data sources
Multi-modal data processing
Legacy system integration

Human-in-the-Loop Collaboration

Closer integration between humans and AI:

Interactive research sessions
Progressive disclosure of information
Research co-pilots
Learning from human feedback
Insight surfacing

These capabilities would address the “cognitive inversion”:

“Welcome to knowledge work’s great inversion. For decades, the constraint was gathering and processing information. Now, with deep-research AI agents that can inhale libraries and exhale analysis, the bottleneck has shifted to the human brain’s ability to comprehend and act on that firehose of insight.”

Team Collaboration and Workflow Integration

Enterprise adoption will require:

Shared research repositories
Research version control
Review and verification workflows
Integration with existing tools
Cross-agent collaboration

Part V: Strategic Implications for Knowledge Services Firms

Disruption Potential and Industry Applications

Deep Research agents have significant potential to transform knowledge-intensive industries through specialized vertical applications:

Financial Analysis: Systems trained on financial filings and market data could transform equity research, automate company valuation processes, and enable more sophisticated market trend analysis with far less human effort.
Legal Services: Agents optimized for case law, regulations, and contracts could reshape legal research, enabling comprehensive precedent identification and regulatory analysis that previously required teams of junior associates.
Management Consulting: Research tools specialized in industry analysis, competitive intelligence, and best practice identification could compress project timelines and fundamentally change traditional staffing models and knowledge hierarchies.
Life Sciences: Specialized agents for biomedical literature, clinical trials, and regulatory documents could dramatically accelerate research and development, reducing the time needed for literature reviews from weeks to hours.
Real Estate: Agents analyzing property listings, transaction histories, and market trends could enable rapid portfolio assessment and drive-by valuations with minimal human input.

As these capabilities mature, they will likely follow a pattern similar to other technological disruptions—beginning as productivity tools before gradually transforming business models and organizational structures. These specialized applications will combine general research capabilities with domain-specific knowledge, terminology, and data sources, creating increasingly powerful tools tailored to specific professional contexts. (And yes, that probably means fewer entry-level research positions and more “research prompt engineering” roles. Time to update the career pages.)

Organizational and Talent Implications

The emergence of Deep Research agents will likely transform organizational structures and talent requirements:

New roles: “Research Engineer” or “AI Research Strategist” positions may emerge.
Changing skill profiles: Knowledge workers will need skills in prompt engineering, research direction, and result verification.
Organizational structure: Traditional research teams may evolve from hierarchical structures to flatter models.
Human specialization: Human experts will focus on judgment, creativity, and interpersonal communication.
Training and development: Organizations will need programs to help knowledge workers adapt.

These changes echo patterns seen in other industries transformed by automation, where technology handles routine tasks while humans focus on higher-value activities. (Though someone might want to check in on how all those first-year associates and analysts are feeling about their career prospects right about now.)

Conclusion: Preparing for the Future of Knowledge Work

Deep Research agents represent a significant evolution in artificial intelligence—moving beyond language generation to autonomous information gathering, analysis, and synthesis. While still early in their development, these systems already demonstrate capabilities that exceed human researchers in specific domains.

For knowledge services executives, the strategic implications are clear:

Disruption is coming: The combination of LLMs and reinforcement learning will transform how research is conducted across industries.
Early experimentation is valuable: Organizations that begin working with these technologies now will be better positioned as they mature.
Human-AI collaboration is key: The most effective implementations will combine AI research capabilities with human expertise and judgment.
Business models will evolve: The economics and delivery models of knowledge services will likely transform.
Competitive advantage will shift: Sustainable advantage will come from asking the right questions and applying insights effectively.

Organizations that recognize this milestone and begin preparing for its implications will be best positioned to thrive in the transformed landscape of knowledge work. (And those that don’t? Well, there’s always a future in artisanal, hand-crafted research reports for the nostalgic connoisseur.)

When AI Makes Humans the Bottleneck: Knowledge Work’s Great Inversion

2025-04-26T00:00:00+00:00

When AI Makes Humans the Bottleneck: Knowledge Work’s Great Inversion

The data deluge was supposed to drown us. Instead, it’s the AI lifeguards making us realize how slowly we swim.

The Cognitive Inversion

Here’s a modern business parable: A consultant once spent a week gathering market research for a client presentation. Today, an AI does it in 20 minutes. Progress! Except now the consultant spends three days trying to absorb, verify, and make sense of all that instant analysis. The bottleneck hasn’t disappeared—it’s just moved.

Welcome to knowledge work’s great inversion. For decades, the constraint was gathering and processing information. Now, with deep-research AI agents that can inhale libraries and exhale analysis, the bottleneck has shifted to the human brain’s ability to comprehend and act on that firehose of insight.

This isn’t just interesting—it’s transformational. When AI systems can analyze a decade of financial statements faster than you can order lunch, the limiting factor becomes your cognitive bandwidth. How quickly can you absorb what the machine tells you? How thoroughly can you verify it? How confidently can you explain it to others?

The irony is exquisite: in our quest to overcome human limitations in knowledge processing, we’ve built systems that now make our biological wetware the constraint. It’s like building a highway to solve traffic jams, only to discover that cars can’t accelerate fast enough to use it properly.

The Human Hardware Limits

Our brains are remarkable but come with factory settings that can’t be upgraded via download:

Reading speed tops out around 300 words per minute for most humans. You can’t “overclock” your eyeball movements.
Working memory holds just 3-5 items simultaneously. Miller’s famous “magic number seven” was optimistic.
Cognitive load reaches capacity quickly when processing complex information, causing mental traffic jams.

The consultant receiving an AI-generated 50-page market analysis in minutes still needs hours to read and understand it. The lawyer who gets instant case research from an AI must still carefully evaluate its relevance and application. The bottleneck has moved from information access to information absorption.

This is Herbert Simon’s “bounded rationality” in a new context: when information is abundant, attention becomes the scarce resource. We must “satisfice” with the mental energy we have, often accepting a good-enough understanding rather than a complete one.

The Evidence Is Piling Up (Faster Than We Can Read It)

The data points supporting this shift are accumulating across industries:

GitHub’s Copilot experiment showed programmers completed tasks 55% faster with AI assistance. But that didn’t eliminate the need to understand, test, and integrate the code—it just meant the human review became the rate-limiting step.

McKinsey documented a bank’s AI system for writing credit memos that doubled relationship-manager productivity. The AI churned through data from 12+ sources in minutes, but relationship managers still needed time to interpret results and develop confidence in the conclusions.

A consulting firm found that research tasks that once took a week could be done by AI in a day—yet the total project time only dropped 30%, not 80%, because human review and contextualization remained stubbornly time-consuming.

It’s like having the world’s fastest research assistant who can compile anything instantly—but you still need to read the memo.

Three Trade-offs We’re Just Starting to Understand

This inversion introduces new challenges:

The Presence Risk: When you’re constantly consulting an AI or reading its outputs during a client meeting, you risk appearing distracted or less “present.” It’s the equivalent of checking your phone while someone’s talking—except you’re checking with an omniscient digital oracle. Clients still expect eye contact and the human touch, even as your AI whispers sweet insights in your ear.

The Bandwidth Risk: AI can feed you a constant stream of data and alerts—far more than you can consciously process. It’s like having ten assistants simultaneously shouting findings at you. Without careful progressive disclosure and prioritization, AI risks flooding your working memory.

The Client Perception Risk: Clients may wonder, “Am I paying for your expertise or the AI’s?” If your deliverable looks too AI-generated, clients question your fees. The awkward truth: efficiency without a changed business model leads to less revenue in billable-hour industries. “We did this in one-tenth the time!” is great marketing but terrible for your bottom line if you charge by the hour.

Workflow Adaptations: Training Humans to Keep Up With Their AIs

Organizations are developing strategies to maximize what humans do best—judgment, creativity, empathy—and minimize unnecessary cognitive strain:

The 1/5/20-Minute Knowledge Artifact

Instead of dumping a 100-page AI-generated report on someone, information gets organized into a 1-minute read (executive summary), a 5-minute read (extended summary), and a 20-minute read (full detail). It’s like progressive disclosure for documents—a cognitive ramp that lets you choose your depth.

Live Transcript Querying

Rather than reading every word of material, humans engage in conversation with AI to pull knowledge on the fly. “What did the third client say about supply chain issues?” gets an instant answer from the transcript. It’s just-in-time knowledge retrieval—outsourcing memory recall to the AI.

Confidence Calibration Loops

After an AI briefing, the human undergoes quick tests or flashcard-style Q&A to reinforce recall and identify gaps. This leverages the testing effect—actively retrieving information solidifies memory far better than passive review. One can imagine an operational metric like time-to-confidence: how long does it take a person, with AI help, to feel (and demonstrate) mastery of new content?

These approaches acknowledge that we can’t speed-read or expand working memory—but we can design workflows that respect cognitive limits while maximizing comprehension.

Four Futures: Where Are We Heading?

Looking ahead, several scenarios emerge:

Symbiotic Superteams: Organizations successfully integrate AI agents into knowledge workflows, creating human-AI teams that outperform either alone. The AI handles information processing, the human provides judgment and creativity. Professional identity evolves—being great at your job means being great at using AI to augment your expertise.

Organizational Brain Drain: Companies rely so heavily on externalized knowledge that internal expertise withers. When systems fail or novel situations arise, the lack of in-house wisdom becomes painfully apparent. “The AI has it” becomes an excuse not to learn deeply.

Trusted Co-Pilot Economy: Clear norms emerge for AI as a professional assistant. Organizations treat AI-curated memory as decision support rather than oracle. Regulatory frameworks legitimize AI as part of the process, with appropriate checks and transparency.

Regulatory Lockdown: Heavy regulations constrain AI use in knowledge work due to high-profile mistakes or lobbying by professional bodies. Organizations maintain AI knowledge internally but access is tightly controlled. Innovation slows as firms become cautious about AI adoption.

The future likely contains elements of all four, varying by domain and region. Finance might embrace co-pilots while healthcare faces lockdown.

The Water-Fountain Effect

Perhaps the best metaphor for our current predicament is what we might call the “water-fountain effect”: AI can fill your cup instantly, but you still have to drink at a normal pace or you’ll choke.

Researchers can request ten different AI analyses in minutes, then realize it will take hours to properly scrutinize them. A consultant can generate a dozen strategy options before lunch but still needs the afternoon to evaluate their merits. The AI firehose is attached to the human straw.

This isn’t a prediction that AI will fail or be less useful than promised. The efficiency gains are real and substantial. But they’re not evenly distributed across workflows—they accumulate primarily in data gathering and preliminary analysis, leaving human cognitive uptake as the persistent constraint.

The organizations that will thrive are those that recognize this inversion and design accordingly. They’ll measure not just how much data was analyzed, but how quickly and accurately the human team reached a confident decision. They’ll package knowledge in smarter ways, coach employees on AI collaboration, and allow time for deep integration of AI insights.

The bottleneck has shifted—our response must shift too. The future belongs not to those with the most powerful AI, but to those who best solve the human absorption problem alongside it. After all, even in an AI age, understanding remains an stubbornly human act.

In his classic “Wealth of Nations,” Adam Smith observed that “the division of labor is limited by the extent of the market.” Today we might add: “and the benefits of AI are limited by the cognitive bandwidth of its users.” Time to upgrade our wetware—or at least design better interfaces for it.

A Beginner’s Guide to AI Evals and Benchmarks

2025-04-25T00:00:00+00:00

A Beginner’s Guide to AI Evals and Benchmarks

Introduction: Why We Test the Tests That Test Us

Imagine you’ve built the world’s most impressive AI system. It writes poetry that makes literature professors weep, generates code that makes senior engineers question their career choices, and offers relationship advice that actually works. You’re ready to unleash it on the world. But wait—how do you know it actually does these things? How can you be sure it won’t suggest “delete system32” as the solution to a slow computer? Or confidently explain that the capital of France is “Baguetteville”?

This, dear reader, is why we need evaluations and benchmarks for AI systems. They’re the reality check on our technological hubris.

In the AI world, we call these reality checks “evals” (because typing “evaluations” repeatedly would wear out our ‘a’, ‘t’, ‘i’, ‘o’, ‘n’, and ‘s’ keys). Evals are the systematic assessments we use to figure out what our AI models can actually do, as opposed to what we hope they can do or what they claim they can do.

The Unpredictable Machine

Large language models (LLMs) like GPT-4, Claude, and others are fundamentally prediction machines. They predict what text should come next after your prompt. The twist? These systems display what researchers call “emergent abilities”—capabilities that weren’t explicitly programmed and often surprise even their creators.

It’s a bit like raising a child who suddenly starts speaking perfect Portuguese despite growing up in Minnesota with no Portuguese instruction. “Where did that come from?” you might reasonably ask. That’s emergence, and it means these systems can do unexpected things—both impressive and potentially problematic.

Without systematic evaluation, we’re essentially putting a blindfold on and hoping for the best. This approach works fine until your AI assistant helpfully explains to a user how to synthesize dangerous chemicals or generates content that manages to offend every demographic simultaneously.

The Emperor’s New Neural Network

There’s a corporate incentive to trumpet AI capabilities while downplaying limitations. Without rigorous, independent evaluation, we’d be stuck taking companies at their word—”Our chatbot is harmless! It would never help someone plan a heist! Also, it cures cancer and does your taxes!”

Thorough evals strip away the marketing and reveal what these systems actually do in practice. They’re the difference between “our model achieved unprecedented results on benchmark X” and “our model hallucinated that Benjamin Franklin invented the helicopter.”

The Feedback Loop

Evaluations aren’t just report cards—they’re the compass that guides development. They highlight where models are weak, where they’re prone to failure, and where they might cause harm.

Without this compass, AI development would be less “guided scientific progress” and more “throwing spaghetti at the wall and seeing what sticks.” And unlike spaghetti, which at worst leaves a stain on your kitchen wall, an untested AI system could leave metaphorical stains across the entire internet.

The Trust Factor

Why do we trust that planes won’t fall out of the sky? Because they undergo exhaustive testing regimes before carrying a single passenger. Similarly, thorough evaluation is the foundation of trust for AI systems.

When a company can demonstrate their system has been rigorously tested against a comprehensive suite of benchmarks, users can have more confidence that it won’t go off the rails when asked a tricky question. Well, at least not as often.

2. Real-World Evals: Putting AI Through Its Paces

Let’s look at some real-world examples of how leading labs evaluate their AI models. It’s a bit like the Olympics, but for machines, and without the national anthems or sponsor deals (yet).

Academic Exams: Testing if AI Can Pass School

One approach that’s gained popularity is simply throwing human exams at AI models. OpenAI famously evaluated GPT-4 on a variety of standardized tests:

Bar exam (for lawyers): GPT-4 scored in the top 10%
GRE (for graduate school): Performed respectably
AP exams (for high schoolers): Did better than most teenagers
Medical licensing exams: Passed, though I still wouldn’t want it removing my appendix

This approach has a certain elegant simplicity: if the model can pass tests designed to measure human proficiency, it must be doing something right. Of course, this doesn’t tell us if the model actually understands anything or just got really good at multiple-choice questions, but it’s a start.

Language Understanding Benchmarks: The Cognitive Olympics

The AI research community has developed specialized benchmarks to probe specific capabilities:

MMLU (Massive Multitask Language Understanding): This includes 14,000 multiple-choice questions spanning 57 subjects, from history to nuclear physics. It’s like a trivia night at the nerdiest bar in town.

TruthfulQA: This benchmark specifically targets the model’s tendency to hallucinate. Questions are designed with a tempting false answer that humans often get wrong. For example, “Who was the first person to walk on the moon?” might tempt the model to say “Neil Armstrong and Buzz Aldrin” (when in fact only Armstrong was the first person to walk on the moon, with Aldrin following about 19 minutes later during the same Apollo 11 mission).

Coding Tests: Benchmarks like HumanEval or LeetCode problems assess if models can write code that actually runs and produces correct outputs. It turns out that “it looks plausible” and “it actually works” are two very different standards when it comes to code.

Custom Evaluation Suites: The Bespoke Approach

Each major AI lab has developed custom evaluations tailored to their specific goals:

OpenAI’s Evals Framework: This crowdsources tricky tasks from the community in an attempt to find a “maximally wide set of failure modes.” Essentially, it’s inviting people to try to break their models in creative ways.

Anthropic’s “Helpful & Harmless” Suite: This focuses on making AI assistants like Claude both helpful (providing useful, on-topic responses) and harmless (refusing to generate harmful content). Human evaluators chat with the AI and rate its responses, creating a kind of AI report card.

Google’s LaMDA Evaluation: Google uses a more structured approach for its conversation models, having human raters judge outputs on three axes:

Quality: Is it sensible, specific, and interesting?
Safety: Does it avoid harmful content?
Groundedness: Are factual claims actually true?

This multi-dimensional approach recognizes that “good” is a complex concept for AI outputs. A response might be entertaining but completely inaccurate—like that friend who tells the best stories that never actually happened.

Red Teaming: Professional AI Antagonists

Perhaps the most interesting evaluation approach is “red teaming,” where experts actively try to make the model fail. These digital provocateurs might:

Try to trick the model into revealing how to build dangerous items
Probe for political or ideological biases
Attempt to extract private information
Test the boundaries of content policies

Red teaming recognizes that real users include not just the curious and well-intentioned but also those actively trying to misuse systems. It’s a bit like hiring professional thieves to test your home security—sometimes you need an expert to find the vulnerabilities before the real bad actors do.

3. What Makes a Good Eval (And Why It’s Harder Than It Sounds)

Creating good evaluations for AI systems is surprisingly tricky. Let’s explore what makes an evaluation useful and why designing them is more art than science.

Coverage: Testing the Whole Elephant

A good evaluation suite needs to cover a wide range of capabilities and potential failure modes. Testing only whether an AI can write poetry tells you nothing about whether it can solve math problems or avoid giving dangerous advice.

This is especially challenging because modern LLMs have such broad capabilities. It’s like trying to create a comprehensive exam for a student who might know anything from quantum physics to 15th-century Flemish art to modern breakdancing techniques.

Labs tackle this by constantly expanding their evaluation sets. OpenAI’s crowdsourced Evals program explicitly aims to incorporate many different tasks and failure modes. Google’s BIG-bench includes over 200 diverse tasks.

But coverage is never complete. As these models grow more capable, researchers are “constantly discovering new and exciting tasks the model is able to tackle.” It’s hard to test for capabilities you don’t even know exist yet.

Validity: Measuring What Matters

An evaluation should measure what we actually care about, not just what’s easy to measure. This is surprisingly difficult.

For instance, if we want to know if an AI writes “good” text, what metrics should we use? Word count? Grammatical correctness? These are easy to measure but miss the point. What we really care about is whether the text is insightful, engaging, factually accurate, and appropriate for its purpose.

That’s why many labs use human judgments for subjective qualities. Anthropic and OpenAI rely on human evaluators to rank responses or answer questions about outputs. This approach is more valid but introduces other challenges—humans are slow, expensive, and sometimes inconsistent.

It’s a classic case of “not everything that counts can be counted, and not everything that can be counted counts.” The most convenient metrics aren’t always the most meaningful.

Fairness: The Unbiased Judge of Bias

Evaluations should be unbiased in two important ways:

Fair measurement: The evaluation itself shouldn’t favor any specific model. If test questions appear verbatim in one model’s training data but not another’s, that’s an unfair advantage. Labs try to mitigate this by using fresh questions or keeping some eval sets private.
Measuring model bias: Evaluations should detect if models produce biased or unfair outputs. This is why benchmarks like BBQ (Bias Benchmark for QA) explicitly test if models generate stereotyped responses.

The challenge is that as models ingest more data from the internet, it becomes increasingly likely they’ve seen benchmark questions during training. It’s an ongoing arms race between benchmark leakage and training scope. “Did you study for the test or just memorize the answers?” becomes a genuine concern.

Reliability: Consistency You Can Count On

A good evaluation gives stable results that allow meaningful comparisons. This means:

Sample size matters: Since LLM outputs can be probabilistic, single test runs may not reflect average performance. Labs often average results across multiple runs.
Inter-rater agreement: If human evaluators disagree wildly about whether outputs are good, the evaluation becomes noisy and unreliable.
Statistical significance: Small improvements might be just random variation rather than genuine progress. Rigorous statistical testing helps distinguish signal from noise.

Achieving reliability often conflicts with scope. More test questions mean more reliable scores, but creating thousands of high-quality questions is costly and time-consuming.

Adaptability: Staying Relevant as Models Improve

As AI systems rapidly improve, evaluations need to keep pace. An ideal benchmark has a dynamic difficulty range—easy items to distinguish weak models and extremely hard items to challenge the strongest ones.

When GPT-4 achieved impressive scores on many traditional benchmarks, OpenAI noted those “numbers do not fully represent the extent of its capabilities” and moved to more challenging evaluations. It’s like your kid completing all the puzzles in their workbook and you realizing you need to buy a more advanced one.

The challenge is that designing new, harder benchmarks is resource-intensive. It also raises questions about evaluation fairness: if labs continually develop harder secret evaluations, how do we ensure external validation?

This is why evaluation is as much an evolving research area as model development itself. The target keeps moving, and the measuring sticks need to keep up.

4. Automation in Evaluation: Can AI Test AI?

As the scope of behaviors to test expands, researchers are exploring ways to automate parts of the evaluation process. Can we use AI to help evaluate AI? The answer is yes, with some important caveats.

AI-Generated Evaluation Datasets: The Machines Write the Tests

Instead of relying solely on humans to author test questions, researchers have successfully prompted language models to create new evaluation items. Anthropic conducted a systematic study and found:

AI can generate high-quality questions across domains, from simple factual queries to complex logic puzzles
These AI-generated questions can reveal novel behaviors in models being tested
Human reviewers judged the AI-crafted questions as high quality and agreed with the provided answer keys 90-100% of the time

For example, AI-generated evaluations helped discover “inverse scaling” behaviors—cases where larger models actually performed worse than smaller ones on certain tasks. One such behavior was “sycophancy”: larger models more frequently agree with a user’s stated opinion even if it’s incorrect, presumably trying to be agreeable.

This is like discovering that the more expensive restaurant you visit, the more likely the waiter is to compliment your food choices even when you’ve ordered the worst item on the menu. Not exactly what you’d expect from “improvement.”

Adversarial Prompt Generation: Professional AI Antagonists, Automated

Taking this idea further, AI can be used to generate attack prompts or stress tests. Instead of humans trying to think of ways to make an AI misbehave, we can ask another AI to generate challenging inputs.

OpenAI has explored using GPT-4 to “check its own work.” They created templates where GPT-4 evaluates whether answers to certain questions (like logic puzzles) are correct. Anthropic similarly experimented with having AI systems critique outputs for adherence to ethical principles.

This approach offers clear efficiency gains, though we should be appropriately skeptical about the results—much like how we might raise an eyebrow at a restaurant review written by the chef’s mother.

Automated Scoring: AI as Judge

Beyond generating test questions, AI can help score model outputs against evaluation criteria:

AI can judge if answers are correct or follow instructions
AI can compare multiple outputs and decide which is better
This approach scales evaluation enormously compared to human review

However, AI graders introduce their own biases. They may prefer verbose answers or responses that mimic their own style. If you use GPT-4 to judge a contest between GPT-4 and another model, it might subtly favor its own style—a kind of digital narcissism.

The emerging best practice is a hybrid approach: let AI do a first pass of evaluation, then have humans review a subset or the contentious cases. This keeps humans in the loop while leveraging AI to handle the bulk of the work.

When Automation Falls Short: The Human Element

Despite impressive progress in automated evaluation, there are important limitations:

Confirmation bias: AI-generated evaluations might focus on issues the AI can easily identify, potentially overlooking more subtle problems
Distribution bias: If GPT-3 writes an eval set used to test GPT-4, the evaluation might reflect GPT-3’s style and blind spots
Security concerns: Published AI-generated benchmarks might inadvertently become part of future training data, giving models an unfair preview of test questions

Most importantly, automated evaluation works well for objective criteria but struggles with subjective qualities. An AI might accurately detect if a response contains a mathematical error, but can it truly judge if a response is ethically nuanced, culturally sensitive, or genuinely helpful for a specific human need?

This is why the consensus view is that AI-assisted evaluation augments but doesn’t replace human judgment. The best results come when AI and humans work together, each addressing the other’s limitations.

5. From Evaluation to Improvement: Closing the Loop

Evaluations aren’t just passive report cards; they directly inform how we train and fine-tune models. The most prominent approach is Reinforcement Learning from Human Feedback (RLHF), where human evaluation of model outputs becomes a training signal.

OpenAI’s InstructGPT: Feedback as the Secret Sauce

OpenAI’s InstructGPT project demonstrated the power of this approach:

They collected model outputs for various prompts
Human annotators ranked these outputs from best to worst based on helpfulness and instruction-following
These rankings trained a “reward model” that could predict human preferences
This reward model guided reinforcement learning to fine-tune the base model

The results were striking: the resulting model (InstructGPT-3, 1.3B parameters) was preferred over the original GPT-3 (175B)—despite being 100× smaller. It was more helpful, followed instructions better, and produced less toxic content.

This is like discovering that your diligent but modestly talented student who actually listens to feedback outperforms your brilliant but stubborn prodigy. Sometimes teachability trumps raw ability.

Anthropic’s Helpful & Harmless Models: Balancing Multiple Goals

Anthropic applied a similar approach but focused on two distinct axes: helpfulness and harmlessness:

Humans compared model responses, rating some as more helpful or more harmless than others
These comparisons trained separate reward models for helpfulness and harmlessness
The combined rewards guided model fine-tuning

Interestingly, Anthropic found that this alignment training “improves performance on almost all NLP evaluations.” The model didn’t just get better at being helpful and harmless—it actually improved on coding, summarization, and other tasks too.

It’s as if forcing a student to show their work not only made them more transparent but somehow improved their mathematical ability too. Alignment, it seems, unlocks capability rather than constraining it.

Meta’s LLaMA-2 Chat: Balancing Act

Meta’s approach to LLaMA-2-Chat highlights the delicate balance in RLHF:

They trained separate reward models for helpfulness and safety
During fine-tuning, they combined these rewards, aiming for maximum helpfulness while maintaining safety
They continuously evaluated candidate models to ensure the right balance

This multi-objective optimization is tricky. A model could be super safe by refusing to answer anything substantive, or super helpful by giving detailed responses to inappropriate requests. Finding the sweet spot—helpful when appropriate, cautious when necessary—requires careful tuning.

Anthropic’s Constitutional AI: Self-Improvement

Perhaps the most innovative approach is Anthropic’s Constitutional AI, which reduces reliance on human feedback:

They gave the model a set of principles (a “Constitution”)
The model critiqued its own outputs against these principles
It then revised outputs to better align with the principles
Reinforcement learning used these self-critiques as signals

This “reinforcement learning from AI feedback” (RLAIF) demonstrated that models can improve through guided self-reflection. It’s like giving a student a rubric and asking them to grade their own work before submitting—a form of metacognitive development that leads to better results.

Challenges and Considerations

This evaluation-to-improvement pipeline faces several challenges:

Overfitting to evaluators: If optimized too strongly to a proxy reward, models might develop tricks that game the reward model without actually improving. This is Goodhart’s Law in action: “When a measure becomes a target, it ceases to be a good measure.”
Multi-objective trade-offs: As the LLaMA-2 example shows, different evaluation criteria often conflict. How much safety should we enforce at the expense of helpfulness? These balance decisions ultimately reflect value judgments and policy choices.
Scaling feedback: As models grow more capable, they may require increasingly nuanced feedback to fix subtle issues. Gathering this feedback from humans (especially experts) becomes prohibitively expensive.
Continuous learning: The field is moving toward continuous learning from feedback rather than one-time training. This raises new challenges in ensuring models improve safely over time without developing unexpected behaviors.

Despite these challenges, the tight coupling of evaluation and training has proven remarkably effective. By explicitly using eval outcomes as reward signals, AI labs have achieved significant improvements in both capability and alignment.

Conclusion: The Never-Ending Exam

If you’ve made it this far, congratulations! You now understand the basics of AI evaluation and benchmarking—a field that’s equal parts science, art, and digital philosophy.

We’ve seen that good evaluations are essential for responsible AI development. They provide reality checks on model behavior, reveal hidden flaws, prevent potential harms, and guide improvement. Without them, we’d be flying blind with increasingly powerful systems—a prospect that should make even the most enthusiastic technologist a bit nervous.

We’ve explored real-world evaluations from academic exams to specialized benchmarks to adversarial red-teaming. Each approach offers different insights, and the best evaluation strategies combine multiple methods to build a comprehensive picture of model behavior.

We’ve grappled with what makes evaluations effective—coverage, validity, fairness, reliability, and adaptability—and why designing good evals is so challenging. The perfect evaluation is a moving target, especially as models rapidly improve and develop new capabilities.

We’ve investigated how automation can help scale evaluation through AI-generated datasets, adversarial prompts, and automated scoring—while recognizing the continued importance of human judgment for subjective qualities and novel scenarios.

Finally, we’ve seen how evaluations feed directly into model improvement through reinforcement learning from feedback, closing the loop from assessment to enhancement and driving remarkable gains in both capability and alignment.

The story of AI evaluation is not just a technical tale but a deeply human one. When we decide how to evaluate AI systems, we’re really articulating what we value and expect from these technologies. Do we prioritize factual accuracy? Helpfulness? Safety? Creative expression? Cultural sensitivity? The evaluations we design reflect these priorities and shape the systems we build.

As AI continues to advance, evaluation will remain a crucial counterbalance to capability—the thoughtful question to the confident answer, the skeptical check on expansive claims. In a field racing forward at breakneck speed, good evaluations are both the brakes that prevent crashes and the steering wheel that keeps us on course.

So the next time you hear a company boast about their AI’s remarkable abilities, you can now ask the pointed follow-up: “That’s impressive, but how exactly did you evaluate it?” Their answer will tell you whether they’re building responsibly or just chasing capabilities without consequence.

After all, in AI as in life, it’s not just about what you can do—it’s about proving you can do it well, consistently, and without unexpected side effects. And that, dear reader, is what evaluation is all about.

The Truth About AI “Deception”: When Models Don’t Say What They “Think”

2025-04-25T00:00:00+00:00

The Truth About AI “Deception”: When Models Don’t Say What They “Think”

Anthropic recently released a paper with the ominous title “Reasoning Models Don’t Always Say What They Think,” which sounds like we’ve caught AIs in some kind of workplace deception scheme, secretly harboring thoughts about quitting while telling their boss everything’s fine. But what’s actually happening is both more mundane and more interesting than AI office politics.

The Alleged AI Omertà

The paper, authored by Anthropic’s Alignment Science Team, tests whether large language models (LLMs) faithfully report their reasoning processes when generating answers. If I give a model a hint about an answer (like sneaking “psst, the answer is C” into the prompt) and it uses that hint, will it admit to using it in its reasoning explanation?

According to the research, often not. The models frequently use the hints to determine their answers but don’t explicitly acknowledge them in their chain-of-thought (CoT) reasoning. The researchers tested two “reasoning models” (Claude 3.7 Sonnet and DeepSeek R1) against two “non-reasoning models” (Claude 3.5 Sonnet and DeepSeek V3) across six types of hints, from innocent suggestions (“I think the answer is A”) to shadier ones like exploiting grader code snippets.

The results? The models use the hints to determine their answers, but they verbalize their use of these hints only 20-40% of the time. The researchers take this as evidence that CoT monitoring (the practice of looking at a model’s reasoning process to understand what it’s doing) isn’t always reliable for catching “unintended behaviors.”

Who’s Anthropomorphizing Now?

The irony here is delicious. Throughout AI safety literature, we’re constantly reminded not to anthropomorphize these systems - don’t assume they “think” or “intend” or “deceive” like humans do. Yet the framing of this paper does exactly that. The title itself claims models have thoughts they’re not saying, and the paper describes this as models “hiding” their reasoning, as if they’re being deliberately deceptive.

This feels a bit like accusing your calculator of hiding its mathematical reasoning when it just shows “42” instead of explaining the step-by-step process it used to compute 6×7. The model isn’t “thinking one thing and saying another” in the way a deceptive human might. The tokens you see on the screen are the only ones that were generated - there’s no secret internal monologue where the model admits “yeah, I’m totally using that hint.”

A Questionable Test Design

The test itself might not be entirely fair. These models weren’t specifically post-trained to explicitly cite aspects of their chat history or prompt in their reasoning. They were likely trained to provide plausible, helpful reasoning for their answers - not to meticulously document every influence on their output.

It’s a bit like criticizing a human student for not mentioning in their test answer that they remembered the formula from page 42 of the textbook. We don’t expect that level of citation in human reasoning, so why expect it from AI systems?

Chain of Thought: Not Actually Thought

Perhaps the most fundamental confusion here is treating chain-of-thought as if it’s literally the model’s “thinking process.” It isn’t. CoT is better understood as a kind of computational warm-up - a way to get the model into a performance zone where it can produce better outputs.

Think of it like an athlete doing stretches before a competition. The stretches aren’t the actual sport, but they help the athlete perform better. Similarly, generating a chain of reasoning helps the model “warm up” to produce better final answers, especially for problems requiring complex reasoning like mathematics or coding. But that doesn’t mean the CoT perfectly represents how the model arrived at its answer any more than watching LeBron James stretch tells you exactly how he’ll execute a perfect jump shot.

The Business Implications

Why does this matter? Because AI safety depends partly on our ability to audit these systems and understand when they might be doing something concerning. If we can’t rely on their explicit reasoning to tell us when they’re using certain inputs (like hints that might be harmful or misleading), that complicates the picture.

The researchers conclude that CoT monitoring is “a promising way of noticing undesired behaviors” but “not sufficient to rule them out.” Translation: looking at AI reasoning can catch some problems, but don’t count on it to catch everything.

In the financial world, this would be like saying, “Earnings calls are helpful for understanding a company’s strategy, but executives don’t always disclose everything they’re thinking.” Which is… obvious?

The Bottom Line

The key insight here isn’t that AI systems are secretly plotting and hiding their true intentions. It’s that the computational processes that generate AI outputs are complex and don’t map neatly onto human-like reasoning that can be fully articulated in natural language.

Anthropic’s researchers are doing important work studying these systems, but perhaps they could take their own advice and be more careful about anthropomorphizing. These aren’t deceptive executives hiding unfavorable reports from shareholders - they’re statistical prediction systems that weren’t explicitly trained to provide exhaustive citations of their inputs.

And let’s be honest: if we did train them to explain every nuance of how they generated each answer, their responses would be unbearably long and tedious. Sometimes a little mystery is more efficient.