Dinesh DM | The Real Cost of AI: FinOps & TBM

Decoding the Cost Genome of Agentic AI

1. Introduction

Cloud cost is a function of provisioning. Enterprises pay for capacity allocated in advance, with usage variance bounded by architectural decisions made before the workload runs. Agent cost is a function of runtime behavior. Each user request triggers a chain of metered actions: model inference, context retrieval, tool calls, output review. The cost of a single agent transaction is determined at execution time, not at design time.

This shift breaks several assumptions that current cost frameworks rely on. Traditional FinOps treats spend as a function of provisioned resources. Existing TBM mappings allocate cloud spend to applications and services with stable boundaries. Both approaches assume the unit of work is bounded and predictable. Agent workloads are neither.

The 2026 FinOps for AI Overview identifies this gap directly, citing a "lack of generally accepted frameworks for cost allocation across multi-agent workloads." The TBM Council's July 2025 Taxonomy 5.0.1 introduces an Artificial Intelligence Solutions category with five sub-categories, including Agentic AI, but does not yet specify how cost decomposes within an agent transaction. Standards work is in progress at OpenTelemetry, the FinOps Foundation, and the TBM Council, but the request-flow cost shape of agent workloads remains underspecified.

This paper proposes a framework for that shape: seven cost layers per agent transaction, a reliability multiplier across all seven, mapped to TBM 5.0 cost pools and resource towers, with cost per successful task as the unit metric. The framework is grounded in public pricing data, independent research, and published vendor documentation as of April 2026.

Method

Sources are limited to public material: vendor pricing pages, research papers, standards documentation, and published case studies. Pricing references are dated and will require refresh. No proprietary or internal data is used. Claims that could not be sourced are stated as the author's argument, with assumptions noted in line.

2. The shift from provisioned to runtime cost

Enterprise IT cost has gone through two visible transitions in the past two decades. The first was virtualization and cloud, which moved infrastructure cost from capital purchases to operational subscriptions. The second was SaaS, which moved application cost from licensing to consumption. Both transitions changed the timing and granularity of spend without changing the underlying assumption that cost is a function of provisioned resources.

Agent workloads represent a third transition. Cost is now a function of how a non-deterministic system chooses to execute a task. The same user query, processed by the same agent, can produce materially different costs on different runs depending on the model's reasoning path, the number of tool calls invoked, and the number of retries triggered by transient failures. The 2026 State of FinOps report finds that 98% of FinOps practices now manage AI spend, compared to 31% two years earlier. (FinOps Foundation, State of FinOps 2026, https://data.finops.org/.)

Request-flow cost accumulation

In a provisioned model, cost is determined by the architect at design time. A virtual machine of a given size costs a known amount per hour. A SaaS subscription with a known seat count costs a known amount per month. The variance is operational, not architectural.

In an agent model, cost is determined by the agent at runtime. A simple query may require a single inference call. A complex query may trigger seven tool calls, three reasoning steps, two failed attempts, and a final fallback to a more expensive model. The cost of each transaction is the sum of metered actions across multiple billing surfaces, each with its own pricing model.

This is the request-flow shape of agent cost. Spend accumulates along the path of a single transaction, across up to seven distinct cost layers described in Section 3, each with its own optimization levers. Reliability, addressed in Section 4, multiplies the total.

The Klarna case

The Klarna AI assistant is the most documented public case study of agent economics through this transition. The February 2024 announcement reported 2.3 million conversations per month, work equivalent to 700 full-time human agents, average resolution time reduced from eleven minutes to two, and approximately $40 million in projected annual profit impact. (Klarna, Klarna AI assistant handles two-thirds of customer service chats in its first month, February 2024.)

By 2025, Klarna had begun rebuilding its human customer service workforce.

The original 2024 metrics were not incorrect. They were incomplete. Conversation volume measured usage. Resolution time measured productivity. Neither captured cost per successful resolution, the share of conversations where the agent provided a poor response, or the engineering investment required to maintain reliability as the product surface expanded. The metrics that drove the 2024 narrative did not include the reliability dimension that drove the 2025 reversal.

The framework proposed in this paper is intended to surface both dimensions explicitly. Section 3 decomposes the cost numerator. Section 4 introduces the reliability multiplier. Section 6 defines the unit metric (cost per successful task) that the Klarna 2024 analysis lacked.

Implications for cost taxonomy

Agent cost requires a taxonomy that does four things. It must decompose spend at the request level rather than the provisioning level. It must attribute spend to specific agents and tasks, not just to applications. It must accommodate non-deterministic execution, where identical inputs can produce different costs. It must connect to existing financial categories so data flows from observability traces to the financial ledger without manual translation.

The remainder of this paper addresses each of these requirements.

3. The seven-layer cost genome

A single agent transaction can incur cost across seven distinct layers. The layers are peers, not a hierarchy. Reliability, addressed separately in Section 4, multiplies all seven.

Each layer is described below using the same structure: definition, scaling logic, reference pricing as of April 2026, optimization levers with sources, and TBM 5.0 mapping. Pricing references will require periodic refresh. The structural mapping should remain stable.

Layer 1: Inference

Definition. The model API call that performs the agent's reasoning step.

Scaling. Per token consumed. Input tokens, output tokens, and cached tokens are billed at different rates. Reasoning models bill chain-of-thought tokens as output tokens, which can make a query routed to a reasoning model significantly more expensive than the same query routed to a comparable non-reasoning model.

Reference pricing, April 2026. Claude Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens. GPT-5.4 is $2.50 input and $15 output. Gemini 3 Pro Preview is $2 input and $12 output. Claude Haiku 4.5 is $1 and $5. GPT-4o-mini is $0.15 and $0.60. The price differential between flagship reasoning models and lightweight production models exceeds thirty times for equivalent token volumes. (Anthropic pricing page, https://claude.com/pricing. OpenAI pricing page, https://developers.openai.com/api/docs/pricing. Google Vertex AI pricing page, https://cloud.google.com/vertex-ai/generative-ai/pricing. All fetched April 27, 2026.)

Optimization levers. Two levers are well-evidenced in published research.

Model cascading routes simpler queries to lower-cost models and reserves higher-cost models for queries that require them. The FrugalGPT paper reports cost reductions of 50% to 98% at GPT-4-equivalent quality on standard benchmarks. (Chen, L., Zaharia, M., Zou, J., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, TMLR 2024, arXiv:2305.05176.)

Cost-aware routing extends cascading by training a router model to predict which queries require the higher-cost path. The RouteLLM paper reports up to 85% cost reduction at 95% of GPT-4 quality on MT-Bench. (Ong, I., Almahairi, A., Wu, V., Chiang, W., Wu, T., Gonzalez, J. E., Kadous, M. W., Stoica, I., RouteLLM: Learning to Route LLMs with Preference Data, 2024, arXiv:2406.18665.)

Both levers require evaluation infrastructure (Layer 6) to verify that quality has not degraded post-routing.

TBM 5.0 mapping. Cost Pool: Cloud Services / Cloud Service Provider for API consumption. Resource Tower: AI Compute when self-hosted on GPU infrastructure, or Application / AI Models when consumed as an API. Solution: Agentic AI.

Layer 2: Context and Memory

Definition. The infrastructure that supplies the model with information not contained in its weights. Includes vector search for retrieval-augmented generation, embedding generation, conversation history within a session, persistent memory across sessions, and prompt caching.

This layer treats per-turn context (RAG) and across-turn memory (state, cache, history) as a single layer. Both share infrastructure (vector databases, storage, embedding services) and typically arrive on the same vendor invoice.

Scaling. Per embedding generated. Per gigabyte stored. Per vector queried. Per cached token read or written.

Reference pricing, April 2026. OpenAI text-embedding-3-large is priced at $0.13 per million tokens. Pinecone Serverless is priced at $0.33 per gigabyte-month for storage, $8.25 per million read units, and $2.00 per million write units. Anthropic prompt cache reads are billed at 0.1 times the input price, a 90% discount on cached content. (OpenAI pricing page. Pinecone pricing page, https://www.pinecone.io/pricing/. Anthropic prompt caching documentation, https://platform.claude.com/docs/en/build-with-claude/prompt-caching.)

The vector database pricing model has shifted since 2023. Earlier cost analyses citing Pinecone at "$0.10 per gigabyte per day" reflect a pricing model that no longer applies. Current pricing is per gigabyte per month with read and write units billed separately, and the structural difference materially affects total cost calculations.

Optimization levers.

Prompt caching reuses the prefix of a request across multiple calls. Anthropic charges 0.1× the input price on cache reads. OpenAI charges between 0.1× and 0.5× depending on the model. Google Gemini charges 0.1× on Gemini 2.5 and later models. The break-even point on Anthropic's caching is approximately two cached reads after one cache write, after which each additional read is at a 90% discount. (Anthropic, OpenAI, and Google Gemini caching documentation.)

Retrieval-augmented generation reduces inference cost relative to long-context approaches that include all potentially relevant information in the prompt. Li et al. compared the two approaches at EMNLP 2024 Industry Track and reported that long-context approaches consistently produce higher quality at materially higher cost, while RAG approaches produce comparable accuracy at lower cost on most benchmarks. The paper proposes a hybrid approach called Self-Route. (Li, Z., Li, C., Zhang, M., Mei, Q., Bendersky, M., Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, EMNLP 2024 Industry Track, arXiv:2407.16833.)

A related finding from Liu et al. is that long contexts exhibit recall degradation in the middle of the context window. Information placed at the beginning or end is recalled reliably; information placed in the middle is not. This affects both quality and cost: prompts that fill the context window may produce worse outputs than smaller, targeted prompts. (Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024, arXiv:2307.03172.)

TBM 5.0 mapping. Cost Pool: Cloud Services for cloud-hosted vector databases and embedding APIs. Software & SaaS / SaaS for managed vector databases billed as subscriptions. Resource Tower: AI Storage and Data / Data Operations. Solution: Agentic AI.

Layer 3: Orchestration

Definition. The control plane that determines the agent's next action. Includes tool selection, step sequencing, retry logic, and the framework code that integrates models, tools, and memory.

Scaling. Per orchestration step. Per tool selection decision. Per agent loop iteration.

Cost dynamics. Orchestration spend is small per step but accumulates significantly because orchestration drives every other layer. Inefficient orchestration produces unnecessary loops and redundant tool calls, inflating spend in Layers 1 and 4. Efficient orchestration terminates reasoning when confidence thresholds are met, reducing both inference and tool spend.

Anthropic's December 2024 publication Building Effective AI Agents draws an explicit distinction between workflows (predictable, deterministic, lower cost) and agents (flexible, non-deterministic, higher cost). The recommendation is that workflows should be the default architectural pattern, with agent patterns reserved for tasks that require runtime flexibility. (Schluntz, E., Zhang, B., Building Effective AI Agents, Anthropic, December 2024.)

Empirical evidence on tool count. The Shopify engineering team published technical detail on Sidekick, their production agent, in 2025. The team reports tool-count thresholds: agents with zero to twenty tools select tools clearly, twenty to fifty tools produce ambiguity, and fifty or more tools produce tool collisions. The team measured Cohen's kappa on tool selection improving from 0.02 to 0.61 with the addition of just-in-time tool instructions. The number of tools an agent has access to is therefore a cost driver, because it inflates orchestration token spend on tool selection reasoning. (Shopify Engineering, Building production-ready agentic systems: Lessons from Shopify Sidekick, 2025.)

Optimization levers.

Adaptive step management caps the number of reasoning loops per query. Multiple 2024 and 2025 preprints report 35% to 77% token reduction without accuracy loss. (Dai et al., Early Stopping Chain-of-thoughts in Large Language Models, 2025, arXiv:2509.14004. Jiang et al., FlashThink: An Early Exit Method For Efficient Reasoning, 2025, arXiv:2505.13949.)

Workflow patterns are preferred over agent patterns where deterministic logic is sufficient, as documented in the Anthropic guidance referenced above.

TBM 5.0 mapping. Cost Pool: Cloud Services for serverless or container runtime. Resource Tower: Application / Container Orchestration or Application / AI Models, depending on the framework. Solution: Agentic AI.

Layer 4: External APIs

Definition. Calls from the agent to systems outside its own context. Includes CRM lookups, search APIs, code execution environments, payment systems, internal microservices, and document processing services.

Scaling. Per call. Per character processed. Per session-hour. Pricing varies by tool and by vendor.

Reference pricing, April 2026. Anthropic web search is priced at $10 per 1,000 calls. OpenAI web search is $10 per 1,000 calls for GPT-4o and $25 per 1,000 calls for reasoning models. Google Maps grounding is $14 per 1,000 calls. Anthropic code execution is $0.05 per container hour after a 50-hour daily free tier. Anthropic managed agent runtime is $0.08 per session-hour. (Anthropic, OpenAI, and Vertex AI pricing pages, fetched April 27, 2026.)

Indirect cost. Each external API call triggers additional inference cost in Layer 1 because the agent must read and interpret the response. A search call priced at $0.01 in tool fees may produce $0.05 in inference fees on the response interpretation. External API spend therefore inflates inference spend by a multiplier that depends on response size and reasoning complexity.

This indirect coupling is one reason the seven layers cannot be analyzed in isolation. Optimization levers in Layer 4 (fewer tool calls) reduce Layer 1 cost. Optimization levers in Layer 3 (lower tool counts) reduce both Layer 4 and Layer 1.

Optimization levers. Cap tool calls per session. Cache tool results when underlying data is stable. Prefer batch calls where the API supports them. Pre-resolve high-frequency tool calls with synthetic context.

TBM 5.0 mapping. Cost Pool: Outside Services for vendor APIs, or Software & SaaS for SaaS-billed APIs. Resource Tower: domain-specific. A CRM API maps to the Application tower for the CRM. An identity API maps to Risk & Compliance / Identity & Access Governance. This layer is the primary connection point between agent cost and the rest of the IT estate, making accurate domain mapping essential for chargeback.

Layer 5: Fine-tuning and Training

Definition. Customization of the underlying model for the agent's domain. Includes LoRA adapters, full fine-tuning, retrieval-tuned embeddings, RAG index construction, and reinforcement learning on agent traces.

Rationale for peer-layer status. Most cost frameworks treat fine-tuning as a one-time project cost. In production agent deployments, fine-tuning is recurring. Models are retuned on new agent traces, expanded tool surfaces, observed failure modes, and updated safety findings. The TBM Council's 2024 publication TBM for AI Value Realization identifies "uncontrolled model retraining costs" as one of six AI-specific financial risks. (TBM Council, TBM for AI Value Realization, October 2024.)

Scaling. Per token of training data. Per GPU-hour. Per provisioned-throughput unit per minute or per month.

Reference pricing, April 2026. AWS Bedrock Custom Model Import is approximately $0.057 per Custom Model Unit per minute in us-east-1. Cohere Command provisioned throughput is $49.50 per hour with no commitment, decreasing to $23.77 per hour with a six-month commitment. AWS p5.48xlarge instances for self-managed training have a published list rate above $90 per hour, with capacity-block reservations near $31 per hour and various third-party trackers reporting rates in between. (AWS Bedrock pricing page, https://aws.amazon.com/bedrock/pricing/. AWS p5 instance pricing page.)

The variance in p5 instance pricing across reservation models is wide enough that single-number citations are misleading. Cost models for self-managed training must account for reservation strategy, not only on-demand rates.

Optimization levers.

Right-sizing between full fine-tuning, LoRA adapters, and prompt engineering. Most production gains can be captured with LoRA adapters. Full fine-tuning is appropriate when domain-specific behavior cannot be reached through smaller interventions. Prompt engineering is appropriate when the desired behavior can be elicited through examples.

Provisioned throughput is cost-effective only above approximately 80% utilization. The MetLife contribution to the FinOps for AI Working Group documents the formula: spend equals PTU rate multiplied by (2 minus utilization rate) multiplied by token count divided by token unit. Below 80% utilization, on-demand pricing is preferable. (Barney, J., How to Build a Generative AI Cost and Usage Tracker, FinOps Foundation, https://www.finops.org/wg/how-to-build-a-generative-ai-cost-and-usage-tracker/.)

TBM 5.0 mapping. Cost Pool: Cloud Services for training compute, plus Staffing for fine-tuning labor and data preparation. Resource Tower: AI Compute, AI Storage, Data / Data Operations. Solution: Agentic AI.

Layer 6: Evaluation

Definition. Execution of the agent against a test set in response to changes. Triggers include prompt changes, tool changes, model changes, and release events. Mature implementations run evaluations on every commit.

Rationale for peer-layer status. Evaluation is now a continuous cost rather than a project cost. By 2026, mature teams treat evaluation as a deployment gate, and the associated spend is recurring and material.

Scaling. Each evaluation run consumes the same inference and tool spend as production traffic, multiplied by the size of the evaluation set, multiplied by the change frequency.

Reference cost. A 1,000-task evaluation set running against an agent with $0.30 per-task cost incurs $300 per evaluation run. Running the evaluation on every pull request can produce evaluation spend comparable to production spend. The τ-bench paper reports approximately $200 per full benchmark trial across the agent and user simulator components. (Yao, S., Shinn, N., Razavi, P., Narasimhan, K., τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024, arXiv:2406.12045.)

Empirical anchor. Anthropic's Demystifying evals for AI agents documents pass^k as the appropriate evaluation metric for agent reliability. The Sonnet 4.5 release notes report that internal code-edit error rate decreased from 9% to 0% specifically because the team ran higher-k evaluations during training. Evaluation is the mechanism through which reliability improvements are measured and verified. (Anthropic Engineering, Demystifying evals for AI agents. Anthropic, Introducing Claude Sonnet 4.5.)

Optimization levers.

Stratified sampling reduces evaluation spend by running smaller test subsets weighted toward higher-risk task categories on each change, with full evaluations triggered on a less frequent schedule.

Cheap evaluations (LLM-as-judge) gating expensive evaluations (full multi-turn agent runs) reduce average per-change cost.

Aggressive caching of evaluation traces reduces redundant computation when prompt changes do not invalidate prior results.

TBM 5.0 mapping. Cost Pool: Cloud Services for evaluation compute. Software & SaaS for managed evaluation platforms (LangSmith, Langfuse, Braintrust, and similar). Resource Tower: Application / AI Models. Solution: Agentic AI.

Layer 7: Governance

Definition. The cost of operating agents responsibly. Includes logging, content filtering, human-in-the-loop review, audit trails, compliance tooling, and incident response.

Scaling. Per output reviewed. Per log byte stored. Per compliance check executed. Plus labor cost for human review.

Cost magnitude. A regulated agent deployment may run a compliance check on every output, which itself is an inference call. An AI output detection service may charge approximately $0.001 per message. At one million messages per month, governance spend is $1,000. At ten million, $10,000. Human review at fifty hours per month, fully loaded at $50 per hour, adds $2,500.

The TBM Council's 2024 framework characterizes governance directly. Of the six AI-specific financial risks identified, five are governance-shaped: runaway operating expenses, underestimated data costs, data privacy penalties, uncontrolled retraining costs, and talent-related costs. Only "overprovisioned infrastructure" is purely operational. Governance spend is therefore not auxiliary to AI deployment but a central component of it. (TBM Council, TBM for AI Value Realization, 2024.)

Optimization levers.

Risk-tiered review applies human review only to outputs in high-risk categories, with automated review for lower-risk surfaces.

Automated compliance checks are deployed where rules are deterministic, with human review reserved for surfaces requiring judgment.

Sampling and structured logging reduce storage costs relative to full prompt-and-response logging at scale, while preserving audit and debugging capability.

TBM 5.0 mapping. Cost Pool: Staffing / Internal Labor for human review. Software & SaaS for monitoring and compliance tools. Resource Tower: Risk & Compliance / Regulatory & Audit, Risk & Compliance / Identity & Access Governance, Security / Digital Security. Solution: Agentic AI.

Layer interaction

A single user query can incur cost in all seven layers simultaneously. An orchestrator (Layer 3) triggers an inference call (Layer 1) using cached context and a freshly retrieved document (Layer 2), the model decides to call a CRM API (Layer 4), the response is logged for audit (Layer 7), the trace is sampled into the evaluation set (Layer 6), and a portion of recent traces will inform the next adapter retraining cycle (Layer 5). The total cost of the query is the sum across all seven layers, multiplied by the reliability tax described in Section 4.

The relative weight of each layer varies by workload. A question-answering agent is typically dominated by Layer 1. An agent integrating with multiple enterprise systems may be dominated by Layer 4. A regulated-industry agent will see Layer 7 grow in absolute terms. The framework does not assume fixed proportions. It makes each layer visible so the team can identify which layer drove a change in total spend.

4. The reliability multiplier

Reliability is not a layer in the cost genome. It is a multiplier across all seven layers.

Conventional FinOps cost models assume that each unit of work incurs a stable cost. One API call corresponds to one billable event. One transaction corresponds to one cost. This assumption is correct for deterministic systems. It does not hold for agent systems.

Agent systems fail at higher rates than deterministic software. They retry. They fall back to alternative models. They escalate to human operators. Each failure path incurs cost without delivering additional value. Total cost across the seven layers is therefore not the simple sum of layer costs, but the sum multiplied by a reliability tax.

total_cost = sum(layer_costs) × (1 + retry_rate + failover_rate)

A 10% retry rate inflates total cost by 10% with no additional value produced. The inflated spend does not appear in any single layer because it is a property of the system, not of any individual component.

Empirical evidence

The strongest published evidence for the reliability tax is the τ-bench paper, released by researchers at Sierra and Princeton in 2024. State-of-the-art agents at the time of publication succeeded on under 50% of retail tasks at the first attempt. At eight attempts, success rates dropped below 25%. (Yao, S., Shinn, N., Razavi, P., Narasimhan, K., τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024, arXiv:2406.12045.)

The metric introduced for this measurement is pass^k. It represents the probability that the agent succeeds on every one of k consecutive attempts at the same task. Pass^1 is the success rate on a single attempt. Pass^8 is the success rate on eight consecutive attempts. The decline from pass^1 to pass^8 is the retry tax made empirically visible.

Anthropic's engineering documentation reports pass^k as the appropriate metric for agent design and notes that a 75% per-step success rate compounds to a 42% three-step success rate. The arithmetic is elementary. The implication for cost is that any agent operating in a multi-step workflow produces non-trivial retry cost as a function of step count and per-step reliability. (Anthropic Engineering, Demystifying evals for AI agents.)

The METR time-horizon analysis provides a complementary framing. Reliability degrades with task length: GPT-5.1-Codex-Max has a 50%-success-rate horizon of approximately 2 hours 40 minutes and an 80%-success-rate horizon of approximately 30 minutes. Longer tasks fail at higher rates than shorter tasks. (METR, Time horizons, https://metr.org/time-horizons/.)

Implications for chargeback

A FinOps practice that allocates agent spend without adjusting for reliability will systematically misallocate cost between teams. A team running a 90% reliable agent imposes a higher per-task cost on the shared infrastructure than a team running a 99% reliable agent for the same workload. The 99% team has invested engineering effort in reliability that the 90% team has not. A chargeback model that accounts only for token spend or compute consumption fails to reflect this distinction.

The corresponding pattern in cloud FinOps is the allocation of overprovisioning costs to the teams that overprovisioned. The reliability tax is the equivalent pattern for agent workloads. Reliability investment reduces shared cost; reliability neglect increases it. The chargeback model should reflect the difference.

The Klarna reference

The Klarna case study introduced in Section 2 illustrates the cost consequence of measuring without the reliability dimension. The 2024 metrics measured conversation volume and resolution time. They did not measure cost per successful resolution or the share of conversations where the agent's response was poor enough to drive escalation, churn, or repeat contact. The retry tax was paid by customers and by downstream business processes, not by the cost dashboard. The 2025 reversal reflected, among other factors, the accumulated cost of unreliable resolutions that the original metrics did not surface.

Current limitation

The reliability tax for typical production agents is not currently a public number.

The data exists on internal dashboards inside enterprises operating agents at scale. It is not in published research. The major agent observability platforms (Datadog, Langfuse, Helicone, Braintrust, Honeycomb) do not currently publish aggregate retry rates or failover rates from production agent traffic. The closest peer-reviewed proxy is the τ-bench pass^k data, which is a benchmark measurement on curated task sets with controlled tool surfaces. Production traffic is more heterogeneous, and failure rates are likely higher.

This is a current limitation of the field rather than a flaw of any specific party. The data is sensitive, the platforms holding it are still maturing, and the discipline is establishing which measurements are most useful. As cloud cost benchmarks became publicly available over the past decade, agent reliability benchmarks are likely to follow a similar trajectory. Until they do, claims about the reliability tax in this paper are estimates built on incomplete data. The framework structure holds; specific reliability constants do not.

5. TBM 5.0 mapping

The mapping table below connects the seven cost layers to the financial categories defined in TBM Taxonomy version 5.0.1.

Reference framework

TBM 5.0.1, published by the TBM Council in July 2025, organizes technology cost across four layers: Cost Pools (the source of spend), Resource Towers (where the spend is applied), Solutions (what is delivered), and Consumers (who consumes the delivered services). (TBM Council, TBM Taxonomy Version 5.0.1, July 2025, https://www.tbmcouncil.org/taxonomy/.)

Two properties of TBM 5.0.1 directly affect agent cost mapping.

First, Artificial Intelligence is a Solution category, not a Resource Tower. The category contains five sub-categories: Agentic, Generative, Interpretive, Predictive, and Prescriptive. Agentic AI is one sub-category, with three further sub-types: Autonomous Navigation, Autonomous Workflow Agent, and Intelligent Process Automation. Cost associated with operating an agent therefore distributes across multiple Resource Towers and rolls up to the Agentic AI Solution category. Mapping all agent cost to an "AI Tower" misrepresents the structure of TBM 5.0.

Second, the cost pool for cloud spend in TBM 5.0.1 is "Cloud Services," not "Public Cloud." An earlier draft of the taxonomy proposed a dedicated Public Cloud cost pool; the published version did not adopt this proposal. Cloud Services covers public, private, and hybrid IaaS and PaaS. SaaS is categorized separately under Software & SaaS / SaaS. For agent cost mapping, this means most API-billed inference spend lands in Cloud Services / Cloud Service Provider, while subscription-billed AI platforms land in Software & SaaS / SaaS.

Mapping table

Layer	Cost Pool	Resource Tower	Solution
1. Inference	Cloud Services / Cloud Service Provider	AI Compute (self-hosted) or Application / AI Models (API)	Agentic
2. Context and Memory	Cloud Services or Software & SaaS / SaaS	AI Storage; Data / Data Operations	Agentic
3. Orchestration	Cloud Services	Application / Container Orchestration or Application / AI Models	Agentic
4. External APIs	Outside Services or Software & SaaS	Domain-specific (varies by tool)	Agentic
5. Fine-tuning and Training	Cloud Services + Staffing	AI Compute; AI Storage; Data / Data Operations	Agentic
6. Evaluation	Cloud Services + Software & SaaS	Application / AI Models	Agentic
7. Governance	Staffing + Software & SaaS	Risk & Compliance; Security	Agentic

Mapping observations

Five distinct Resource Towers carry agent cost in this mapping. This is not a deficiency in TBM 5.0; it reflects the architectural reality that an agent is a system spanning multiple infrastructure and application categories. A finance organization that allocates all agent spend to a single tower obscures information that the taxonomy is designed to make visible.

The mapping also functions as a chargeback design. A team operating an agent integrated with a CRM should see Layer 4 (External APIs) cost flow into the Application tower for the CRM, rather than into a generic AI bucket. A team operating an agent with intensive human review should see Layer 7 (Governance) cost flow into the Risk & Compliance tower. Properly applied, this approach reduces the size of the AI line item by allocating spend to the systems that drove demand.

Operational requirements

The mapping is structural. Operational implementation requires three additional capabilities.

Tagging. Every span in the agent's execution trace must carry attributes for agent identity, user identity, and task identity. Without these, layer-level decomposition cannot be computed.

Reconciliation. Observability platforms compute cost from vendor pricing tables they maintain internally. The cloud bill is the source of truth. Reconciliation between the two must occur on a recurring schedule.

Ownership. Each row in the mapping requires explicit ownership. Layer 1 is typically owned by the engineering team building the agent. Layer 7 is typically owned by security or compliance. Layer 5 may be split. Ownership ambiguity converts the mapping from a control mechanism into a reporting artifact.

6. Cost per Successful Task

Cost per Successful Task (CST) is the unit metric the framework uses to summarize agent economics. It is defined as total cost across the seven layers, multiplied by the reliability tax, divided by the number of tasks the agent completed correctly.

Prior art

The metric is not novel. Karpushin published the same name and equation in February 2026. Gartner's January 2026 Predicts analysis uses cost-per-resolution as its central measurement for customer service agents. Intercom Fin charges $0.99 per resolution. Salesforce Agentforce charges per successful action. Ada and Decagon use comparable outcome-based pricing. The shape of the metric is established across both industry and commercial usage. (Karpushin, K., AI Agent Development Cost: Real Cost per Successful Task for 2026, Codebridge, February 2026. Gartner Press Release, January 26, 2026. Intercom Fin pricing page.)

Framework contributions

This paper extends the existing metric in two ways.

The numerator is decomposed into the seven cost layers from Section 3. A practitioner optimizing CST must identify which layer constrains the metric. Inference-dominated cost suggests cascading and routing. Tool-call-dominated cost suggests caching and call caps. Evaluation-dominated cost suggests stratified sampling. Governance-dominated cost suggests risk-tiered review. Without layer decomposition, optimization activity is undirected.

The denominator excludes failed attempts and retries. A team can reduce per-task cost by halving inference spend while tripling failure rates; the result is a lower cost-per-call number that masks higher cost-per-successful-task. The reliability tax from Section 4 is built into the metric directly through the success-only denominator.

Position within the FinOps Framework

The FinOps Framework defines Unit Economics as a capability under the Quantify Business Value domain. CST is one instance of this capability, applied to agent workloads. (FinOps Foundation, FinOps Framework, https://www.finops.org/framework/.)

This positioning is operationally significant. CST is not a new metric category requiring new tooling or new organizational structures. It is the agent-specific shape of an existing FinOps capability. A FinOps practice that already calculates unit economics for cloud workloads can extend the practice to agents using its existing methods.

Implementation steps

CST implementation requires four steps.

Span tagging. Each span in the agent's execution trace must include attributes for agent identity, user identity, task identity, and task outcome. Most observability platforms support these attributes through OpenTelemetry semantic conventions, addressed in Section 7.

Cost rollup. Cost across all seven layers is aggregated per task identity. Some layer costs are computed at trace ingestion from vendor pricing tables. Other costs require periodic reconciliation against cloud bills.

Outcome marking. Task success is determined at the task boundary, not the call boundary. Success criteria are defined per task type. For a customer service agent, success may be resolution without escalation. For a coding agent, success may be a passing test suite. The success criterion is part of the agent's specification.

Aggregation. CST is reported by agent, by user segment, by task type, and by time window. Aggregation enables chargeback, forecasting, and optimization decisions.

Limitations of the metric

CST is a cost-efficiency metric. It is not a quality metric. An agent can produce a low CST while delivering responses of insufficient quality if the success criterion is set too liberally. CST is also not a complete measure of business value. It does not capture latency, customer satisfaction, escalation rate, or downstream business impact. CST is one number in a portfolio of agent metrics, not a substitute for the portfolio.

Used with discipline, CST provides the unit economics view that FinOps practices require for agent workloads. Used without discipline, it produces the same kind of metric distortion that "cost per click" produced in advertising in the 2010s. The discipline is in defining task success rigorously.

7. The observability bridge

TBM allocates spend to financial categories. FinOps governs spend through processes and reviews. Neither identifies which agent execution incurred which cost. That attribution requires observability data, and the seven-layer framework requires that observability data flow into financial systems without manual translation.

The current state of observability standards for agent workloads is partial.

Standards landscape

OpenTelemetry has published GenAI semantic conventions covering token counts, cache reads and writes, tool invocations, model identity, and message roles. The metric specification defines histograms for token usage and request duration. The semantic conventions package was at version 1.40.0 as of early 2026. (OpenTelemetry, GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/.)

All gen_ai.* attributes are currently at "Development" status. None are stable. Implementers can adopt them, but the contract may change without backward compatibility.

The major LLM observability platforms have shipped support for the conventions. Datadog announced native support in November 2025. OpenLLMetry provides reference instrumentation under Apache-2.0 license and feeds upstream into the OpenTelemetry SIG. Langfuse, LangSmith, Braintrust, Helicone, and OpenLIT consume the conventions in some form. (Datadog Engineering, LLM observability with OpenTelemetry semantic conventions, November 2025. OpenLLMetry, https://github.com/traceloop/openllmetry.)

The cost attribute gap

OpenTelemetry GenAI conventions do not define a standard cost attribute.

Each observability platform computes cost at trace ingestion using a vendor pricing table maintained internally. Datadog maintains a table. Langfuse maintains a table. LangSmith maintains a table. Each table updates on a different schedule, and each platform produces a slightly different cost number for the same trace. The OpenTelemetry metrics specification provides the closest existing primitive: "When systems report both used tokens and billable tokens, instrumentation MUST report billable tokens." This directs implementers to report vendor-billed tokens but does not standardize a cost attribute.

The operational consequence is that enterprises running production agents on observability tooling maintain three sources of cost data: the vendor invoice, the observability platform's calculated cost, and the trace data underlying both. Reconciliation requires manual effort.

Required standards work

Three changes would close the gap between observability and financial systems.

OpenTelemetry stabilizes the GenAI attributes and adds a standard cost attribute, vendor-set at trace ingestion and authoritative for downstream cost analytics.

The FinOps Foundation extends the FOCUS specification with a token column type and a reliability column type. FOCUS 1.3, ratified in December 2025, supports virtual-currency billing patterns through existing PricingUnit and ConsumedQuantity columns. A native token column type would remove ambiguity. A reliability column would make pass-rate and retry-rate data flow into FinOps tooling without enrichment. (FOCUS Specification, https://focus.finops.org.)

The major observability platforms align cost tables to the same vendor sources and publish freshness windows. Current platform freshness practices are not externally visible, which limits enterprise ability to verify cost accuracy.

This standards work spans multiple bodies and vendors. The framework in this paper assumes the standards work will progress. Until it does, the bridge between agent observability and financial systems must be built per-enterprise.

8. Implications for technology and finance leaders

The seven-layer framework, the reliability multiplier, the TBM 5.0 mapping, the CST metric, and the observability bridge support six operational moves for organizations deploying agents at scale.

Tag at the agent and task level rather than the application level. The agent is the unit of cost; the task is the unit of value. Application-level tagging, the default in most cloud cost management implementations, conceals the dimensions required for agent cost analysis. Organizations rolling out agents should agree on a tagging schema before the first agent reaches production. The schema must include agent identity, user identity, task identity, and task outcome as span attributes. Without these attributes, the seven-layer decomposition cannot be computed, and CST cannot be measured.

Decompose the cost dashboard by layer. A single AI line item in financial reporting represents a control failure. The dashboard should display inference, context and memory, orchestration, external APIs, fine-tuning, evaluation, and governance as separate lines. Trends in each layer carry distinct operational signals. Inference growing faster than task volume indicates model selection drift. External APIs growing faster than task volume indicates that agents are calling more tools per task. Evaluation growing as a share of total spend indicates increased release frequency. The dashboard functions as an early-warning system when properly structured.

Adopt CST as the unit metric, alongside layer-level cost reporting. CST does not replace token spend or tool-call spend reporting. It rolls those metrics up into a measurement that can be related to business outcomes. Token spend describes how the agent reasons. Tool-call spend describes how the agent acts. CST describes what the agent achieves per dollar. All three belong in operational reporting.

Treat reliability as a cost driver in the chargeback model. A team operating a 90% reliable agent imposes a higher shared infrastructure cost than a team operating a 99% reliable agent for equivalent work. The reliable team has invested in engineering effort that the unreliable team has not. The chargeback model should reflect this investment. This pattern parallels the existing FinOps practice of allocating overprovisioning costs to the teams that overprovisioned, applied to a new domain.

Document Resource Tower mappings explicitly. Five distinct TBM Resource Towers will carry agent cost in most enterprises. The default mapping in Section 5 is a starting reference, not a prescription. A bank operating a regulated customer-facing agent will weight more cost toward Risk & Compliance towers. A developer-tools company operating an internal coding agent will weight more cost toward Application towers. The mapping must be documented and defended to finance, because it determines how spend is reported and chargeback is calculated.

Budget evaluation and fine-tuning as recurring costs from year one. Both layers are commonly underbudgeted in initial agent deployments. Evaluation spend grows as the agent surface expands, because larger tool surfaces and more task types require larger evaluation sets. Fine-tuning spend grows as the agent matures, because production traces accumulate as training material and the cost of training cycles becomes recurring. Initial budgets should account for this growth trajectory rather than treating it as an in-year adjustment.

9. Conclusion

The 2026 FinOps for AI Overview identifies a gap in cost allocation frameworks for multi-agent workloads. This paper proposes one framework for that gap.

The framework decomposes an agent transaction into seven cost layers (inference, context and memory, orchestration, external APIs, fine-tuning and training, evaluation, governance), with reliability operating as a multiplier across all seven. The seven layers map to TBM 5.0 cost pools, resource towers, and the Agentic AI Solution category. The unit metric is cost per successful task, anchored to the FinOps Framework's existing Unit Economics capability.

Reliability is treated as a cost driver, not a quality metric, with implications for chargeback and unit economics. Evaluation is included as a peer cost layer, reflecting the operational reality that evaluation is now a continuous cost in mature agent deployments. The TBM mapping reflects the published 5.0.1 taxonomy, in which AI is a Solution category rather than a Resource Tower and cloud spend is captured in Cloud Services rather than a Public Cloud pool.

Open requirements

The framework is incomplete in three areas, all of which depend on standards and benchmarking work outside the scope of this paper.

Reliability data for production agents is not currently available in public form. Benchmark proxies exist and provide useful direction but do not substitute for production measurements. As agent observability matures, aggregate reliability data is likely to become available through the same trajectory followed by cloud cost benchmarks over the past decade.

The observability bridge between trace data and financial systems is not standardized. OpenTelemetry GenAI attributes remain at Development status. FOCUS does not yet include native token or reliability column types. Each observability platform computes cost from internally-maintained pricing tables, requiring per-enterprise reconciliation.

The TBM Council and the FinOps Foundation are converging on AI cost from different directions: the TBM Council from procurement and value realization, the FinOps Foundation from cloud cost management. Convergence is in progress. The remaining work is request-flow cost decomposition at the agent transaction level, which is the focus of this paper.

Validity over time

The framework's structural elements are designed to remain stable as the underlying technology and pricing evolve. Specific pricing references will require periodic refresh. Specific reliability constants will require update as production data becomes available. Specific TBM and FinOps mappings will require revision as those standards are revised. The framework structure is intended to accommodate these revisions without requiring redesign.

Appendix A: Pricing reference, April 2026

LLM API pricing (USD per million tokens, Standard tier)

Provider / Model	Input	Cached input	Output
OpenAI GPT-5.4	$2.50	$0.25	$15.00
OpenAI GPT-4.1	$2.00	$0.50	$8.00
OpenAI GPT-4o-mini	$0.15	$0.075	$0.60
OpenAI o1	$15.00	$7.50	$60.00
Anthropic Claude Opus 4.7	$5.00	$0.50	$25.00
Anthropic Claude Sonnet 4.6	$3.00	$0.30	$15.00
Anthropic Claude Haiku 4.5	$1.00	$0.10	$5.00
Google Gemini 3 Pro Preview	$2.00	$0.20	$12.00
Google Gemini 3 Flash Preview	$0.50	$0.05	$3.00
Google Gemini 2.5 Flash	$0.30	$0.03	$2.50

Source: Anthropic, OpenAI, and Google Vertex AI pricing pages, fetched April 27, 2026.

Embeddings pricing (USD per million tokens)

Model	Standard
OpenAI text-embedding-3-small	$0.02
OpenAI text-embedding-3-large	$0.13
Cohere Embed v4	$0.12
Voyage voyage-4	$0.06
Voyage voyage-4-large	$0.12

Vector database pricing models

Service	Pricing model
Pinecone Serverless	$0.33/GB-month + $8.25/M Read Units + $2.00/M Write Units
Weaviate Cloud	from $25/month Standard, $45/month HA minimum
Chroma Cloud	$2.50/GiB write + $0.33/GiB-month + $0.0075/TiB queried
Supabase pgvector	Free / $25 Pro / $599 Team monthly tiers

Appendix B: Glossary

Agent transaction. A complete unit of work performed by an agent in response to a user request, including all model calls, tool calls, and supporting infrastructure consumption required to produce the response.

Cost per Successful Task (CST). Total cost across the seven cost layers, multiplied by the reliability tax, divided by the number of tasks completed correctly.

FOCUS. FinOps Open Cost and Usage Specification. A standardized billing data format maintained by the FinOps Foundation.

pass^k. The probability that an agent succeeds on every one of k consecutive attempts at the same task. A measurement of agent reliability under repeated execution.

Provisioned throughput. A pricing model for AI inference in which capacity is reserved in advance at a fixed rate, contrasted with on-demand pricing where capacity is billed per request.

Reliability multiplier. The factor by which retries and failovers inflate total cost above the sum of layer costs. Calculated as (1 + retry_rate + failover_rate).

Resource Tower. In TBM Taxonomy, a category of technology resources organized by function (Compute, Storage, Network, Application, Data, etc.).

Time-horizon. A METR measurement of the task duration at which an agent achieves a specified success rate. Used to characterize reliability degradation with task length.

Appendix C: Selected references

FinOps Foundation. FinOps for AI Overview. Last updated February 17, 2026. https://www.finops.org/wg/finops-for-ai-overview/

FinOps Foundation. State of FinOps 2026. https://data.finops.org/

TBM Council. TBM Taxonomy Version 5.0.1. July 17, 2025. https://www.tbmcouncil.org/taxonomy/

TBM Council. TBM for AI Value Realization. October 2024.

Chen, L., Zaharia, M., Zou, J. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. TMLR 2024. arXiv:2305.05176.

Ong, I., et al. RouteLLM: Learning to Route LLMs with Preference Data. 2024. arXiv:2406.18665.

Liu, N. F., et al. Lost in the Middle: How Language Models Use Long Contexts. TACL vol. 12, 2024, pp. 157–173. arXiv:2307.03172.

Li, Z., Li, C., Zhang, M., Mei, Q., Bendersky, M. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 Industry Track. arXiv:2407.16833.

Yao, S., Shinn, N., Razavi, P., Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045.

Schluntz, E., Zhang, B. Building Effective AI Agents. Anthropic, December 2024. https://www.anthropic.com/research/building-effective-agents

Anthropic Engineering. Demystifying evals for AI agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

Shopify Engineering. Building production-ready agentic systems: Lessons from Shopify Sidekick. 2025. https://shopify.engineering/building-production-ready-agentic-systems

METR. Time horizons. https://metr.org/time-horizons/

Karpushin, K. AI Agent Development Cost: Real Cost per Successful Task for 2026. Codebridge, February 2026.

OpenTelemetry. GenAI semantic conventions. https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/

FOCUS Specification. https://focus.finops.org

Klarna. Klarna AI assistant handles two-thirds of customer service chats in its first month. February 2024.

Barney, J. How to Build a Generative AI Cost and Usage Tracker. FinOps Foundation, FinOps for AI Working Group. https://www.finops.org/wg/how-to-build-a-generative-ai-cost-and-usage-tracker/