Dinesh DM | The Real Cost of AI: FinOps & TBM

The Anti-Value of AI: Costing the Work That Produces Nothing

Abstract

When ordinary software fails, it stops, and it fails cheaply. When an AI system fails, it can keep spending, and it can fail in a way that costs more than doing nothing. It can produce a confident wrong answer that escapes notice, gets acted on, and has to be undone at a price far above what it cost to generate.

Every cost framework assumes that technology spend buys value of zero or more. AI breaks that assumption. It can spend to produce a result worse than no result, a region this paper calls anti-value. Because no framework was built for spend below zero, the AI bill today blends productive work, wasted work, and harmful work into one number, and reports all of it as if it produced value.

This paper gives Technology Business Management a way to separate the three. It defines the categories of wasted and harmful AI spend, shows how to measure each within honest limits, and books the result without changing the taxonomy. The method adds no new dollar to the bill. It adds two attributes to dollars already booked: a disposition on each AI cost, productive, scrap, or anti-value, and a causal reference that links a downstream cleanup cost back to the AI solution that caused it. From those two attributes it reports one number, the anti-value ratio, the share ofAI spend that bought nothing or bought harm.

The framework is honest about its limits. The line between a productive output and a harmful one takes judgment, and the constants are not yet public. The contribution is not a measured size. It is making a cost visible that the taxonomy was never built to see, because a cost no one can see is a cost no one can manage.

Introduction

An AI system can spend money to produce an answer that is wrong. When the answer is believed and acted on, the spend does not stop there. The wrong answer has to be found and undone, and the undoing costs more than producing it did.

In 2024 Air Canada's website chatbot told a customer he could claim the airline's lower bereavement fare after he had already flown. He booked at full price on that basis and applied for the refund. The policy allowed no such claim, the airline refused, and the tribunal ordered it to pay. (Moffatt v. Air Canada, 2024 BCCRT 149.) The answer cost a few cents of inference to generate. It then cost the fare difference, the staff time to contest the claim, the legal time to lose, and the work to correct the bot. Money was spent, and what it bought was worse than nothing.

Every cost framework ever built assumes that cannot happen. Money spent on technology buys something, worth a lot or a little, but never less than nothing. An idl server wasted only what it cost. A failed deployment failed cheaply, because ordinary software stops when it breaks. The floor on value has always been zero, and the chatbot fell straight through it. This paper calls the spend below that floor anti-value.

Anti-value is the bottom of a single axis. Productive spend sits at the top, returning more than it cost. Zero-value spend sits in the middle, returning nothing, the failed run and the abandoned loop. Anti-value sits at the bottom, returning less than nothing. Ordinary technology lived at the top and touched the middle only briefly, because it failed cheaply.

AI lives across all three.

Technology Business Management exists to connect technology cost to business value. (TBM Council, TBM for AI Value Realization, October 2024.) That is its purpose and its strength. It was built for technology that behaves in an ordinary way, where every dollar of spend maps to work that helps, or at worst does nothing. AI introduced a behavior the framework was never built to meet. The framework is sound. The territory changed.

The consequence shows up on the bill. Today the AI line blends productive work and wasted work into one number and reports all of it as if it produced value. A meaningful share did not. Some of it produced nothing, and some produced harm that cost more to repair than the harm itself. The framework has no place to record the difference, because until AI, no one needed one.

A companion framework, the Cost Genome ofAgentic AI, introduced cost per successful task, which counts spend per good result. (DM, Decoding the Cost Genome ofAgentic AI, April 2026.) That metric already separates good outcomes from failures. This paper turns to the other side of that line. It puts a cost on the bad results, treats the worst of them as anti-value, and books that cost as its own thing rather than folding it into productive spend.

Return to the chatbot to see why the bill cannot show this on its own. The few cents of inference are booked to the AI. The fare difference, the staff hours, and the legal time are booked to the airline's customer service and legal functions, because that is where the money was actually spent. Nothing in the ledger connects the two. TBM records where a cost came from and who consumed it, but it does not record what caused it. For ordinary technology those are the same thing, because a server you pay for is a server you chose to run. AI is the first case where a system's failure produces a cost somewhere else, under a different owner, in a different account.

The taxonomy records the source and loses the cause, and closing that gap is the work of this paper. It adds no new spend to the bill. It redescribes spend already there, and connects spend booked elsewhere back to its cause.

The paper defines the categories of wasted and anti-productive AI spend, shows how to measure each within its limits, books the result as a distinct line, and maps that line into the TBM v5.0.1 taxonomy so it rides on rails the discipline already has. The unit it reports is the anti-value ratio, the share ofAI spend that bought nothing or bought harm.

Method. Sources are public where public sources exist. Claims that cannot be sourced are stated as the author's argument, with assumptions noted in line. The clean line between a productive output and a wasteful one takes judgment, and some failure-rate data is not yet public. Those limits are marked where they fall.

How AI breaks the assumption

Ordinary software fails in one way: it stops. A function throws an error, a request times out, a job aborts, and the spend ends at the point of failure, having produced nothing. The loss is visible at the moment it happens, and it is bounded, because the system does not keep spending after it breaks. There is no state in which it runs on, billing all the while, and hands back something wrong that looks right.

AI fails in two ways instead of one, and both cost more than stopping.

Failure that produces nothing. An AI system can run at length and return no usable result. It loops on a reasoning step, retries a tool call that keeps failing, or abandons a task after burning compute on every attempt. The spend is real, the output is empty, and the meter ran the whole time.

This is not an edge case. In the tau-bench evaluation, the strongest model tested solved about 61 percent of retail tasks and about 35 percent of airline tasks on a single attempt, and combined across the two domains it solved under half. (Yao, S., Shinn, N., Razavi, P., Narasimhan, K., tau-bench, 2024, arXiv:2406.12045.) Reliability is worse than the single attempt number suggests, because the same model's chance of solving one retail task on eight consecutive tries fell to about 25 percent. (Sierra, Benchmarking AI agents, https://sierra.ai/blog/benchmarking-ai-agents.) Longer and repeated work fails more often, a pattern the METR time-horizon analysis also reports as task length grows. (METR, Time horizons, https://metr.org/time-horizons/.) Failure at this rate is a normal operating condition, not a rare one.

The Cost Genome of Agentic AI already accounts for most of this spend through its reliability multiplier, which inflates total cost by the retry and failover rate. (DM, Decoding the Cost Genome ofAgentic AI, April 2026.) The spend is wasteful, but it is at least visible on the bill, because the meter was running while the system tried. This is the zero-value region, the spend that is real but returns nothing.

Failure that produces something wrong. This is the failure ordinary software cannot make. A deterministic system that cannot return the right answer returns an error. An AI system that cannot return the right answer can still return an answer, and the answer is fluent, well formed, confident, and wrong. The wrongness is not caught at the point of production. The output looks like a correct one, so it passes downstream as if it were correct. A person reads it and acts on it, or a system ingests it and carries it forward. The cost of the error lands later, in the work it corrupted, not in the inference that produced it.

Now there are two costs where ordinary failure had one: the cost of producing the wrong output, and the larger cost of finding it and undoing it. Together they are anti-value, spend that produced worse than nothing. The rate at which production AI ships wrong output, and the size of the cleanup, are treated in Section 6, with the available evidence and its limits. The mechanism is the point here.

Why the cost lands elsewhere. Klarna is a documented arc. In early 2024 the company said its AI assistant handled two thirds of customer service chats and did the work of 700 agents. (Klarna, February 2024.) In May 2025 Bloomberg reported that the company was hiring human agents again, after its chief executive said the AI-first approach had produced lower quality service. (Bloomberg, May 2025; Entrepreneur, https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiringmore-humans-not-ai/491396.) The reversal came more than a year after the claim.

Reading that delay as the downstream cost of poor service surfacing late is this author's interpretation, not Klarna's stated reason, which was quality and brand. The point that holds either way is the timing. The cost of the failures did not appear on the same dashboard, or in the same quarter, as the spend that produced them.

This is the structural feature that breaks the value floor. The spend happens inside the AI system. The damage happens somewhere else. The two are recorded in different places, owned by different teams, and nothing connects them.

What the taxonomy records. TBM records two things about any spend: where it came from, and who consumed it. It does not record whether the spend produced value, which was never a gap, because value stayed above zero.

Both new regions land in cost pools without resistance. Zero-value inference is still inference, billed to Cloud Services, applied to AI Compute, and anti-value inference is the same. The taxonomy captures the dollars exactly and loses the outcome entirely. The downstream cost of cleanup lands in a different consumer's line, disconnected from the AI solution that caused it.

So the bill is accurate in dollars and wrong in meaning. Every dollar is in the right pool, and none of it is marked for what it actually achieved. Section 7 shows how the taxonomy can carry that mark without changing its structure.

Where this sits against existing work

The axis from Section 1 already has units on it, and two of them sit close enough to antivalue that the difference has to be made plain, or the new idea reads as a rename. The clearest way to see the difference is to run all three over the same event, the chatbot that sold the wrong bereavement fare.

Cost per successful task. This unit divides total AI spend by the number of good outcomes, so a leader sees what each good result cost. (DM, Decoding the Cost Genome ofAgentic AI, April 2026.) Run it over the chatbot and it reports a low number, because most chats resolve cheaply and the one wrong answer cost the same cents to generate as a right one. The downstream cost, the fare difference and the legal time, never enters, because cost per successful task sums AI spend and onlyAI spend. It also folds the cost of failure into the price of success rather than isolating it. It measures the productive region well, and it cannot see anti-value, because the largest part of anti-value is not AI spend at all.

Yield loss. This unit counts the tokens a factory manufactures and then discards, the scrap that fails evaluation, and it makes the good tokens carry the cost of the bad ones. (DM, Decoding the Token BOM of Enterprise AI, June 2026.) Run it over the chatbot and the wrong bereavement answer is not scrap, because it passed whatever checks were in place and shipped. It looks like a good token, so yield loss counts it as one. The unit measures the zero region on the production side, the spend caught and thrown away before it leaves the factory, and it never follows a token downstream to where the harm appears.

The reliability multiplier in the Cost Genome covers the rest of the zero region from the consumption side. It inflates total cost by the retry and failover rate, so spend on runs that produced nothing is carried. Between them, the reliability multiplier and yield loss map the zero region from both directions, and neither reaches below it.

Anti-value. This unit counts what the others miss, the wrong answer that shipped and the cost of finding and undoing it. It is measured at the business outcome, not at the task and not at the factory, because that is where the chatbot's cost actually landed. Cost per successful task stops at the edge ofAI spend. Yield loss stops at the factory gate. Both stop at zero. Anti-value starts below zero and follows the cost downstream.

Unit	What it counts	Where it is measured	Region of the axis
Cost per successful task	AI spend per good outcome	Consumption side, at the task	Productive
Yield loss	Tokens made then discarded as scrap	Production side, at the factory	Zero, scrap caught before it ships
Anti-value	Spend on wrong output that shipped, plus the cost to undo it	Consumption side, at the business outcome	Negative, harm that escaped

The existing work maps the value axis down to zero from both sides. The region below zero had no unit, because no framework needed one until a system could ship a confident wrong answer and send the cost somewhere else. That region is the contribution.

The categories of wasted and anti-productive spend

The categories sort into the two regions below the line, spend that produced nothing and spend that produced harm. The test that separates them is whether the bad output escaped detection. If it was caught inside the system, the spend is zero-value. If it escaped and was acted on, the spend is anti-value.

Zero-value spend. Three categories sit here.

Failed and abandoned runs. The system worked at a task, produced no usable result, and stopped.

Runaway loops. The system kept iterating past the point of progress, spending on each pass without moving closer to an answer.

Retries and failovers. The system repeated a step, or fell back to a more expensive path, after a transient failure. These are the zero region. The reliability multiplier in the Cost Genome already books retries and failovers, and yield loss in the Token BOM already books the scrap that evaluation catches. They are listed here for completeness, because a full disposition of the AI bill needs all three states. They are not the new part of the picture, which begins below, with the spend that produced harm rather than nothing.

Negative-value spend. A wrong output becomes anti-value when it escapes detection and is acted on or carried forward. This boundary matters. A wrong output caught at the screen, where the user sees it and reruns, costs about as much as a retry, and it is closer to zero-value than to harm. The output crosses into anti-value only when it survives long enough to be trusted. The depth below zero is set by how long the error lived before someone caught it. Detection latency is the driver.

The system makes this likely rather than rare. The output is built to read as correct, and at the point of use it is indistinguishable from a right answer. Acting on it is the expected path, not a lapse.

Three event types sit in this region.

Acted-on error. A wrong fact, figure, or conclusion that a person trusted and used, the wrong number in a board deck or the fabricated citation in a filing. The cost is the production of the output, plus the cost of acting on it and putting it right.

Erroneous agent action. An agent that executed a real action on a wrong conclusion. It sent the message, updated the record, ran the transaction, merged the change. The undo here includes reversing a real change in the world, which is sometimes partial and sometimes impossible.

Propagated error. A wrong output consumed by another system or another agent, carried forward and built upon before anyone catches it. The cost compounds with each step the error travels from its source.

These are not hypothetical. The Air Canada chatbot in Section 1 was an acted-on error, a wrong answer a customer trusted and used. Fabricated AI output reaching real decisions is now a recurring, documented category, including a series of sanctioned court filings built on cases that an AI invented. (Mata v. Avianca, 678 F. Supp. 3d 443, S.D.N.Y. 2023.)

Every event in this region carries the same two-part cost: the spend that produced the wrong output, which is small and sits on the AI bill, and the spend that found it and undid it, which is larger and lands downstream. Section 6 sets out how to measure each part, and how far the public evidence reaches.

Where the cost lives

TBM organizes technology cost across four layers. Cost Pools name the source of the spend. Resource Towers name where it is applied. Solutions name what is delivered. Consumers name who used it. (TBM Council, TBM Taxonomy Version 5.0.1, July 2025, https://www.tbmcouncil.org/taxonomy/.) Cost flows one way through these layers, so each dollar starts in a pool, lands in a tower, rolls up to a solution, and is charged to a consumer. The flow attributes every dollar to the resource that produced it and the consumer that used it.

The production cost of a wrong AI output fits this flow without trouble. The tokens spent on a hallucinated answer are inference spend, so they start in Cloud Services, land in AI Compute, roll up to the Artificial Intelligence solution, and are charged to the business unit that ran the agent. This is the small half of the cost, and the taxonomy handles it exactly as it handles a correct answer, because at the point of production a wrong answer and a right one cost the same.

The undo cost does not fit. When the wrong output is acted on, the cost of finding and reversing it is incurred by the business process that trusted it. Consider an analyst who spends a day correcting a report built on a fabricated figure. That day is labor, so it starts in the Staffing cost pool, lands in the tower that carries the analyst's function, rolls up to the business process the analyst supports, and is charged to that business unit. None of it touches the Artificial Intelligence solution. TBM books it as the cost of running finance, or operations, or whatever process absorbed the error. The booking is correct, because the analyst's time was a cost of that process. It was also caused by the AI, and the taxonomy has no way to record both.

This is the gap. TBM attributes cost by its source, the resource that produced the spend, and by its consumer, the unit that used the resource. For ordinary technology, the source of a cost and its cause are the same thing, because the thing you paid for is the thing you chose to run. AI is the first case where a technology's failure produces a cost somewhere else, under a different owner, in a different pool. The source of the undo cost is human labor and other systems. The cause is the AI. The taxonomy records the source and loses the cause.

One boundary keeps this honest. The framework counts the cost spent to undo the error, the labor, the compute, the incident response. It does not count the value the error destroyed, the lost deal or the lost trust, because value has no object in the taxonomy and reaching for it would turn a cost question into a return question. Anti-value here is incurred cost, on the AI bill and downstream, and nothing more.

The taxonomy is not failing. It was built to follow cost to the resource that produced it, and it does that correctly here. It was built for technology that does not create costs outside itself, because until AI, none did. The structure was right for the territory it was drawn for. The territory now includes a cost that begins in one solution and surfaces in another. Section 7 shows how the existing structure can carry that link without being redrawn.

How to measure it, and how far the evidence reaches

Measurement splits the way the cost splits. For each wrong output there are two costs to find, the cost of producing it and the cost of undoing it. There are also two things to measure about the population of outputs, how often they are wrong and how long the wrong ones survive before someone catches them.

None of this is new in kind. The cost of bad output, and the way that cost grows the later it is caught, is settled ground in quality management. Manufacturing and software quality split it into internal failure, caught before the output ships, and external failure, caught after. (American Society for Quality, cost of quality.) Internal failure is contained and cheap, and external failure is expensive. That split is the same axis used here. Internal failure is the zero region, the scrap the Token BOM already counts as yield loss, and external failure is the negative region, the anti-value counted here. The idea of costing bad output is old. AI is new only in how much external failure it produces in knowledge work, where there was little before.

What is wrong, and how often. Prevalence is measurable, and parts of it are public. In a study of specific, verifiable questions about federal court cases, general-purpose models returned wrong answers between 58 percent of the time for the strongest model tested and 88 percent for the weakest. (Dahl, M., Magesh, V., Suzgun, M., Ho, D. E., Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, Journal of Legal Analysis 16(1), 2024, https://law.stanford.edu/2024/01/11/hallucinating-law-legal- mistakes-with-large-language-models-are-pervasive/.) Purpose-built legal research tools, grounded in retrieval over real case law, did better and still returned wrong answers between roughly 17 and 33 percent of the time. (Magesh, V., et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, Stanford RegLab, 2024, https://reglab.stanford.edu/.)

The point is not the exact rate. It is that careful, grounded, production-grade tools still ship wrong output at rates measured in tens of percent. These numbers are domain specific, and a rate from legal research does not carry to a coding agent or a support assistant. Each enterprise measures its own, against a labeled set of known-correct answers, which is the same evaluation layer the Cost Genome already prices as a cost.

The two costs of a wrong output. The production cost is easy. It is the same token and task cost as a correct output, which the existing frameworks already compute. A wrong answer and a right answer cost the same to generate, so this half of anti-value is small and already on the bill.

The undo cost is the hard one. It is downstream labor to find and fix the error, downstream compute to reprocess, and incident response when the error reached a production system. It is measurable in principle, through time records and incident logs, but it is almost never attributed to the output that caused it. As Section 5 showed, it is booked as the cost of the business process that absorbed it. Measuring anti-value means linking the correction work back to the triggering output, and most organizations do not instrument that link today.

Detection latency. How deep an anti-value event runs below zero depends on how long the error survived. The cost-of-quality literature gives the shape. The common rule of thumb prices prevention at one, internal correction at ten, and external failure at a hundred, and while the exact multipliers are debated, the direction is not. (Cost-of quality practice; the often-cited IBM Systems Sciences Institute ratios are directional and their precise values contested.) An error caught at the screen is cheap. The same error caught after it shaped a decision, or after it propagated, is not.

One field result shows the negative region is real and easy to miss. In a randomized controlled trial, experienced open-source developers took 19 percent longer to finish tasks when allowed to use early-2025 AI tools, and they believed the tools had sped them up. (Becker, J., Rush, N., Barnes, E., Rein, D., Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, METR, July 2025, arXiv:2507.09089, https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/.) This is one setting and one snapshot of tools, and it does not size the effect anywhere else. It establishes two narrower things the framework needs. The net effect ofAI on skilled work can be negative, and the operator can be the worst placed to see it, which is why anti-value has to be measured, not asked about.

The Air Canada case shows the measurement problem in miniature. The recorded cost was small and exact: 650.88 Canadian dollars in damages, and 812 all-in with interest and fees. (Moffatt v. Air Canada, 2024 BCCRT 149.) That figure is the visible floor. It leaves out the airline's legal time, the work to correct the bot, and the standing cost of becoming the case every later discussion ofAI liability cites. The booked number understated the true anti-value, which is the pattern this framework exists to surface.

What is not public. Three constants decide the size of anti-value, and none is public for production AI. The rate at which a given system ships wrong output, by domain and task.

The cost of undoing each kind of error. And the latency between production and detection. The studies that exist are benchmarks or single domains, and while they show the regions are real and material, they do not give an enterprise its own numbers. This is the same honesty the Cost Genome states about reliability constants and the Token BOM states about utilization and yield. The framework supplies the structure. Each enterprise calibrates the constants against its own ground truth and its own correction records, and reports them with the uncertainty kept in view.

How to book it

The gap in Section 5 does not need a new structure to close. It needs two attributes on the structure that already exists.

The first temptation is to give waste its own cost pool. That is the wrong move, and it is a change TBM does not need. A wasted inference is still inference. The dollars came from the same source, ran on the same tower, and belong to the same solution as a productive inference. What differs is the outcome, not the resource, and a separate pool would misdescribe the spend and would alter a taxonomy that is sound. Waste is not a new kind of cost. It is a known cost with a bad result.

So the result rides as an attribute, not a coordinate. Every dollar attributed to an AI solution carries a disposition with three values. Productive, the spend returned a good result. Scrap, the spend returned nothing and was contained, the failed run and the discarded token. Anti-value, the spend returned something wrong that escaped and had to be undone. The cost pool, the tower, and the solution are unchanged, and the disposition is added alongside them. The AI bill can then be cut three ways without moving a single dollar from where the taxonomy already puts it.

The disposition handles the on-bill cost, and the downstream undo cost needs the second attribute. That cost cannot be moved onto the AI solution, because Section 5 showed it belongs, by source, to the business process that absorbed it. The analyst's day really was a cost of finance, and re-homing it would break the same source-based attribution that makes TBM trustworthy. Instead the cost stays where it is and carries a back-reference, caused by this AI solution. The dollar does not move. The cause becomes visible. This is tagging, which the discipline already does, extended from what produced a cost to what caused it.

Put the two attributes on the Air Canada chatbot to see them work. The cents of inference that produced the wrong answer take a disposition of anti-value, sitting in the same Cloud Services pool and AI Compute tower as every other inference. The fare difference and the legal time stay booked to customer service and legal, where they were spent, and each carries a back-reference to the chatbot that caused them. Nothing moved, and the bill is unchanged in dollars. But the chatbot's anti-value can now be read in one place, the cents on the AI line plus the tagged costs sitting in the two other functions.

Both attributes have to be populated, and the data sits in places the prior frameworks already reach. Disposition comes from evaluation, which decides whether an output was good, and from detection, which records whether a bad one was caught and what the fix cost. The causal back-reference comes from linking a correction work item to the output that triggered it, through a shared identifier. The Cost Genome already requires each task to carry agent identity, user identity, task identity, and task outcome. Anti-value sharpens the outcome attribute from pass-or-fail into the three-value disposition, and adds one field, a reference from a correction back to the task that triggered it. The instrumentation is an extension of tagging that is already specified, not a new system.

With both in place, a leader can assemble anti-value across the whole estate as a single reporting line, the way it was just read off the chatbot. The dollars stay in their correct pools and consumers, so the line is a view, not a new object. That is how waste becomes its own line without becoming its own pool.

For this to flow without manual assembly, the disposition and the causal reference would have to live in the data formats that carry cost. FOCUS has no field for either, and the OpenTelemetry generative-AI conventions carry token counts but no outcome and no cost attribute. (OpenTelemetry, Semantic Conventions for Generative AI, https://opentelemetry.io/docs/specs/semconv/gen-ai/.) The Cost Genome already asked

FOCUS for a reliability column, and a disposition field is its near neighbor. Until the formats carry them, the two attributes are populated per enterprise, the same position the earlier frameworks reach on their own bridges.

The structure is untouched. No new pool, no new tower, no broken roll-up. The same dollars travel to the same places, and two attributes ride alongside them, one for the outcome and one for the cause. Section 8 reads the metric off those two attributes.

The metric

The framework reports one number, built from the two attributes in Section 7. The bookable line is the anti-value cost. It is the production cost of the outputs that shipped wrong, taken from the AI solution's own spend, plus the cost of finding and undoing them, taken from the corrections tagged back to that solution. One part is on the AI bill and small. The other is downstream and large. Together they are the cost the AI incurred by being wrong.

The metric is the anti-value ratio. It is the anti-value cost divided by total AI spend, and it reads as the harm cost carried per dollar spent running the AI. A ratio of 0.2 means twenty cents of cleanup for every dollar ofAI. The ratio can exceed one, and when it does it means undoing the system's errors cost more than running the system. The chatbot was that case in miniature, a few cents to produce the wrong answer and hundreds of dollars to undo it.

Beneath the ratio sits the full split. Every dollar ofAI spend falls into one of three dispositions, productive, scrap, or anti-value, and the three shares sum to the bill. The split shows where the money went, while the ratio isolates the part that is new and hardest to see, and adds the downstream cost the split alone would miss.

Illustration, with illustrative numbers. Take a monthlyAI bill of 100,000 dollars. Evaluation and detection mark 70,000 as productive, 10,000 as scrap, and 20,000 as spend that produced output which escaped wrong. The corrections tagged back to those escapes, the analyst hours, the reprocessing, the one incident, come to 60,000 dollars sitting in other business processes. The anti-value cost is the 20,000 on the bill plus the 60,000 downstream, or 80,000, so the anti-value ratio is 80,000 over 100,000, or 0.8. Three quarters of the harm cost never appeared on the AI bill. The numbers are illustrative and prove nothing about size. They show only the shape, that anti-value can sit mostly off the bill, where no one is looking for it. Whether it is large or small in any real system is exactly what the ratio exists to find out.

This is not cost per successful task in another form. Cost per successful task measures the productive region, the price of each good result, while the anti-value ratio measures the negative region, the share of spend that did harm. The two move independently. A system can post a low cost per successful task, because its good results are cheap, while running a high anti-value ratio, because its failures escape and cost a great deal to undo.

One number describes the good work. The other describes the damage. A leader needs both, because neither implies the other. It is also not the retry rate or the yield-loss rate. Those count the zero region, the spend that produced nothing and was contained. The anti-value ratio counts only the negative region, the spend that produced something wrong and let it out.

FinOps already isolates waste. It separates idle resources and over-provisioned capacity from spend that did work, and it allocates each to the team responsible. (FinOps Foundation, FinOps Framework, https://www.finops.org/framework/.) The anti-value ratio names a new category of waste forAI workloads, the spend that produced harm, and gives it the same treatment: isolate it, size it, attribute it to the solution that caused it.

The discipline already has the habit, and this extends it to a waste it did not have before. The ratio is a number a leader can track over time and compare across systems. A rising ratio means more harm per dollar ofAI, a signal that production quality is slipping faster than spend. It also corrects a reading the gross bill gets wrong. Gross AI spend looks like the cost of the work the AI did, but it is not. The productive share is the cost that bought good results, and the rest bought nothing, or bought harm.

The ratio is a cost measure, not a value or quality measure. It says how much was spent on harm, not how much value the harm destroyed, which stays outside the taxonomy. It is also only as honest as the disposition behind it. Set the bar for an escaped error too low, and the ratio flatters the system. Set it too high, and the ratio condemns it. The discipline is in defining the escaped error rigorously, the same discipline cost per successful task needs in defining success. And because the constants in Section 6 are not yet public, the ratio should be reported with its assumptions stated, and as a range rather than a point where the data allows.

What leaders should do

These are moves to build toward, not switches to flip, because most enterprises do not yet hold the data they need. The disposition and the causal link are not captured today, and the first two moves exist to create them. The six moves group in three pairs: the first two build the data, the next two turn it into a decision, and the last two keep it honest over time.

Instrument disposition from the first day of production. Mark everyAI task as productive, scrap, or anti-value. Productive and scrap come from evaluation, which you are likely already running. Anti-value comes from detection, which records when a bad output was caught, how long after it was produced, and what the fix cost. This is one extra field on the task outcome the Cost Genome already asks you to tag. Without it, the AI bill stays a single number that reports waste as if it were work.

Tag corrections back to their cause. The expensive half of anti-value is downstream, and it is invisible until you link it. Require that correction work, incident records, and reprocessing jobs carry a reference to the AI output that triggered them. The cost stays booked to the process that did the work, where it belongs, and the reference is what lets you assemble it later against the AI solution that caused it.

Book anti-value as its own line. Report the assembled anti-value cost next to productive spend, not inside it. A single AI line item, with the good work and the harm blended together, is a control switched off. Three lines, productive, scrap, and anti-value, turn the same total into something a leader can act on.

Replace the gross AI number with two honest ones. The gross bill is wrong in both directions. It overstates the cost that did useful work, because it includes scrap and the production of failed output, and it understates the true cost of the AI, because the downstream cost of harm is not on it. Report two numbers instead. Net productive AI cost, the productive share of the bill, is the cost that actually bought results, and it is the honest base for any value conversation. Fully loaded AI cost, the gross bill plus the downstream undo cost, is what the AI truly cost to run, including the harm it caused. The single gross figure is neither, and it is the one most reports use.

Spend on detection to shrink the deepest costs. The depth of an anti-value event is set by how long the error survived before someone caught it. Evaluation, guardrails, and review that catch errors earlier move spend out of the negative region and into the contained, cheaper scrap region. Seen this way, detection is not overhead on top of the AI. It is the lever that turns an expensive escaped error into a cheap caught one. Budget it as such.

Put the ratio in the reviews you already run. Track the anti-value ratio over time and across systems, in the same reviews that track unit cost and reliability. A system with a healthy cost per successful task and a rising anti-value ratio is one the efficiency number alone will not flag. The ratio is the number that catches a system getting cheaper at producing good answers and more expensive at producing bad ones.

Honest limits

The framework has real limits, and naming them is part of using it well. The boundary takes judgment. The line between an output that was merely wrong and one that did harm is not crisp. It depends on whether the output escaped and was acted on, and on how much of the downstream cost should be charged to it. Two careful practitioners will draw that line in slightly different places. The framework gives the structure of the judgment, but it does not remove the judgment.

Attribution is sometimes ambiguous. Linking a downstream correction to a specific AI output is clean when the error was flagged and traced. It is not clean when the error mixed with human error, or when a bad decision drew on several inputs and the AI was one of them. The causal tag is only as good as the link behind it. Where the link is genuinely unclear, the honest move is to leave the cost untagged rather than charge it to the AI to make the number look complete. Conservative tagging undercounts anti-value, and over-attribution discredits it, so the first error is the safer one.

The constants are not public. Hallucination rates by domain, undo costs, and detection latencies are not published for production AI. The numbers in this work are illustrative, and each enterprise calibrates its own, the same position the Cost Genome reaches on reliability and the Token BOM reaches on utilization and yield. The structure transfers, but the constants do not.

The instrumentation is not off the shelf. The disposition field and the causal back reference are not in FOCUS or the OpenTelemetry conventions. Until they are, the data is assembled per enterprise, which is real work and a real barrier to adoption. The framework assumes standards work will follow the path the cost standards did, and does not depend on it having arrived.

Detection has its own cost and its own optimum. Shortening detection latency reduces anti-value, but detection is not free, and past some point more detection costs more than the harm it prevents. The framework frames the trade, but it does not compute the optimal level of detection spend, which is specific to each system and each cost of error.

The ratio counts cost, not destroyed value. The framework stops at incurred cost on purpose. The full damage of a wrong output that shipped, the lost trust, the abandoned deal, is real and often larger, and it sits outside the taxonomy because value does. The anti-value ratio is a floor on the harm, not the whole of it. It measures what was spent cleaning up, not what was lost.

Conclusion

Every cost framework assumed that technology spend buys value of zero or more. AI broke the assumption. It can spend to produce a wrong result, then spend more to undo it, which is value below zero. This work calls that anti-value and gives TBM a way to record it.

The framework adds nothing to the structure of the taxonomy. It adds two attributes that ride on the structure that exists. The first is a disposition on each AI dollar, productive, scrap, or anti-value. The second is a causal reference that links a downstream cleanup cost back to the AI solution that caused it, without moving the cost from where it belongs. From those two attributes it reports one number, the anti-value ratio, the share of harm cost carried per dollar of AI spend.

It works alongside two earlier frameworks. The Cost Genome ofAgentic AI priced the cost of using AI, and the Token BOM of Enterprise AI priced the cost of making it. Both counted good spend and bad spend together. This framework separates the bad from the good across both, and puts a price on the harm, the part the first two measured but did not name.

TBM was right for the technology it was built for. That technology did not produce costs below zero, so the taxonomy never needed a place for them. AI does, and now it has one, built from the taxonomy's own materials.

The contribution is not a number, and it does not depend on the harm being large. Anti-value is invisible by construction and has no upper bound, so it has to be measured whatever its size turns out to be in any given system. The gross AI bill holds three different things inside one figure, the work that helped, the work that produced nothing, and the work that did damage. A leader who sees only the sum cannot manage any of the three. The chatbot's AI line showed cents of inference and nothing else, while the cost it truly caused sat in three places across the business. Separating those is the first step, and a cost that can be seen is a cost that can be managed.

The structure is built to last. The axis, the two attributes, and the ratio hold as the technology and its prices change. The constants will need calibration as production data becomes available, and the framework is designed to take that calibration without redesign.

Glossary

Anti-value. Spend that produces a result worse than no result at all. The cost of producing a wrong output that escaped detection, plus the cost of finding and undoing it.
Anti-value ratio. The anti-value cost divided by total AI spend. The share ofAI spend that bought nothing or bought harm.
Disposition. An attribute carried on each AI cost line, with three values, productive, scrap, or anti-value. Added alongside the existing taxonomy coordinates, not in place of them.
Scrap. Zero-value spend that produced nothing and was contained inside the system, the failed run and the discarded token.
Detection latency. The time between when a wrong output was produced and when it was caught. The driver of how far an anti-value event runs below zero.
Net productive AI cost. The productive share of the AI bill. The cost that actually bought good results.
Fully loaded AI cost. The gross AI bill plus the downstream cost of undoing its errors. what the AI truly cost to run.
Cost per successful task. Total AI spend across the cost layers divided by the number of tasks completed correctly. The consumption-side unit of the Cost Genome ofAgentic AI.
Yield loss. The share of manufactured tokens that are scrapped before they ship, whose cost is borne by the good tokens. The production-side measure of the Token BOM of Enterprise AI.

References

American Society for Quality. Cost of Quality. https://asq.org/quality-resources/cost-ofquality
Becker, J., Rush, N., Barnes, E., Rein, D. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR, July 2025. arXiv:2507.09089.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Dahl, M., Magesh, V., Suzgun, M., Ho, D. E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis 16(1), 2024.
https://law.stanford.edu/2024/01/11/hallucinating-law-legal-mistakes-with-largelanguage-models-are-pervasive/
DM, Dinesh. Decoding the Cost Genome ofAgentic AI. April 2026. (https://dineshdm.blog/blog/cost-genome-agentic-ai)
DM, Dinesh. Decoding the Token BOM of Enterprise AI. June 2026. (https://dineshdm.blog/blog/token-bom-enterprise-ai)
Entrepreneur. Klarna CEO Reverses Course by Hiring More Humans, Not AI. May 2025.
https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiringmore-humans-not-ai/491396
FinOps Foundation. FinOps Framework. https://www.finops.org/framework/
FOCUS Specification. https://focus.finops.org
Klarna. Klarna AI assistant handles two-thirds of customer service chats in its first month. February 2024.
Magesh, V., et al. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Stanford RegLab, 2024. https://reglab.stanford.edu/
Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y. 2023). https://law.justia.com/cases/federal/district-courts/newyork/nysdce/1:2022cv01461/575368/54/
METR. Time horizons. https://metr.org/time-horizons/
Moffatt v. Air Canada, 2024 BCCRT 149. British Columbia Civil Resolution Tribunal, February 2024.
OpenTelemetry. Semantic Conventions for Generative AI. https://opentelemetry.io/docs/specs/semconv/gen-ai/
Sierra. Benchmarking AI agents. https://sierra.ai/blog/benchmarking-ai-agents
TBM Council. TBM Taxonomy Version 5.0.1. July 2025. https://www.tbmcouncil.org/taxonomy/
TBM Council. TBM forAI Value Realization. October 2024.
Yao, S., Shinn, N., Razavi, P., Narasimhan, K. tau-bench: A Benchmark for Tool-AgentUser Interaction in Real-World Domains. 2024. arXiv:2406.12045