Tech

When Claude changed, everything changed: Managing the blast radius of AI in production

Our system did one thing, and it did it well: It turned natural language queries into API calls.

Users were analysts, account managers, and performance trackers. They knew what data they needed, but putting it together manually meant pulling out four dashboards, two BI tools, and a Salesforce report builder. With our system, they wrote the application in plain English. Application like "Compile a report on sales volume for January to March 2026 for the Northeast region, broken down by city" translated into an API call that the program can run with:

json

{

"definition": "The user requested the sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start date": "2026-01-01",

"end_date": "2026-03-31",

"region": "in the northeast"

}

}

The rest of the pipeline was general engineering. The system sent the call to the right background — we had integration with internal reporting portals, Salesforce, and several home services — we use the large language model (LLM) (-generated by the JSON query to filter and shape the response, and deliver it by email, as a Drive document, or rendered as a chart in the browser).

By mid-2025, the system was producing several hundred reports per month. These reports were consumed by leaders and analysts and circulated to external stakeholders. It has become the default method for many teams to pull ad-hoc data.

The contract between LLM and the rest of the system was a structured JSON object as described in the example above.

json

{

"definition": "The user requested the sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start date": "2026-01-01",

"end_date": "2026-03-31",

"region": "in the northeast"

}

}

We are building on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and went to 4.0 without incident. By the time Sonnet 4.5 was published, we had grown disinterested in the stability and predictability of LLMs in solving what we believed to be a simple problem. Model development has become a practice, like hitting a small version of a well-behaved library.

Then we release 4.5. To get a reasonable percentage of requests, the model started to wrap the content of the post_body into the description field. Two failure modes were followed.

First, the filter parameters never reach the API. Our program is educated post_body as the true source of the payload request, and that field returned empty. The API call was made without a date range or region filter. Depending on which API is called, the backend returns the sales volume for all periods or all regions or returns a 500 error.

Second, the model began asking clarifying questions in its response. This was new. Previous versions always took the best-effort approach to an imprecise request and returned a structured object. Sonnet 4.5, more consciously, sometimes answered with a question instead. Our system had no way of doing this. It is built on the assumption that every request model will lead to an API call. There was no human-in-the-loop component and no country to hold a partially completed application. This has caused the systems to fall down and break in many ways.

We went back to 4.0. That was harder than it should have been: Between releases 4.0 and 4.5, our team added new API integrations, all of which are relevant compared to 4.5. Restoring the model means fitting them all against 4.0 under time pressure.

Why the traditional engineering discipline fails here

Software engineering depends on the ability to bind the effect of change. When you upgrade a driver or library, read the release notes to see if you can expect breaking changes. A unit test looks at what might have moved. You can use the following structure: The adaptive system is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The radius of the explosion is limited by construction.

LLM supported systems break this assumption. The component that produces the output is not under your control. You cannot differentiate the version version from 4.0 to 4.5. It’s more of a replacement for the work your system depends on.

This is what we mean by infinite blast radius: a change in its downstream effects can be calculated in advance because the input space (natural language) and the failure modes (what the model might do differently) are both infinite.

Anatomy of failure

The post-mortem revealed that our report had been inaccurate. We had told the model to return a JSON object with three fields. We had explained what each field was for. We did not explicitly state that the definition must be a natural language string and must not contain sequential representations of other fields.

Previous versions of the model provided this binding from context. Sonnet 4.5, is apparently better at being "help" in its formatting choices, it decided that asking for clarification or providing the body of the request in the explanation made the response more useful. From the perspective of the model, this was a reasonable interpretation of the implicit instruction. However, this violates the assumptions our system is built under.

The bug was not in the model. The bug was in our assumption that the model would continue to fill our specification gaps as it always did. Three successful upgrades trained us to believe that those spaces are safe.

Fixed output modes and tooling APIs would have caught this particular failure at the schema level. We were not using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. The schema cannot specify that the specifying query should not appear in a system that does not have a method specification, nor should the date range be automatically generated automatically all the time. Schemes solve the simple part of the problem.

An evals-first architecture

The discipline that closes this gap is to treat the test environment – not the information – as a formal specification of the system. Prompted by implementation of spec. The model is translator. The evals are the spec itself, and any model or instant change is valid if and only if it passes.

Essentially, eval is threefold: An input, an output field to satisfy, and a scoring function. In our system, the eval that would hold the 4.5 regression looks like this:

python

def test_description_contains_no_serialized_payload(response):

desc = response["description"].down()

not allowed = ["curl", "post_body", "{", " "https://"]

do not reserve (symbol in desc for forbidden token),

f"description leaked edited content: {response['description']}"

A few hundred such structures, some handwritten due to known key variables, some generated as regression tests from real production traffic, some found by LLM-as a judge for fuzzier attributes like tone, become the gateway. Model development and rapid changes should be treated as pull requests that should turn the suite green before they merge.

Evals are expensive to build and maintain. They evolve as your product changes. The LLM-as-judge score makes its own difference in the results. And the suite can only handle the failure modes you’ve considered specifying – you can’t test your path to safety against a failure class you’ve never considered. We learned this lesson the hard way: No one in our group has ever written a statement that says it "the description field should not contain the curl command," because no one thought the model would put one there.

Evals are not a silver bullet. They give you the ability to bound the burst radius of change in a way only available when the underlying function is a black box: By sampling the output response you really care about, and refusing to use it when that behavior goes away.

Road map

The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what is meant by ‘combination’ in the areas of natural language processing. CI/CD systems were not designed to gate potential test results. As agents take on independent work – writing code, moving money, planning infrastructure changes – the gap between "The model passed our smoke test" again "we know what this program will do in production" becomes the major engineering problem of the next few years.

The teams that close that gap will be the ones that stop treating evals as quality assurance considerations and start treating them as a real specification of what their program is.

Vijay Sagar Gullapalli is a founding AI Engineer at Adopt AI and a USPTO patent holder.

Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button