Anthropic Claude API: Zero-Cost Refusals and a Smarter Advisor Tool for Agentic Pipelines

Two small but practical updates landed in the Claude API this week. One fixes a billing quirk that has been quietly burning budget in safety-heavy pipelines. The other gives developers running agentic workloads a proper lever to control how much their advisor model actually says.

Zero-Cost Billing for Empty Refusals

Starting with Claude 4 models, streaming responses can return stop_reason: "refusal" when Anthropic’s streaming classifiers detect a potential policy violation mid-request. Until now, those refusals were billed the same as any other request, even when Claude had generated zero output tokens.

That changes now. If a request returns stop_reason: "refusal" with no generated tokens, you are no longer charged for it.

The key word is no generated tokens. If Claude partially responded before the refusal triggered, that request is still billed normally, because real work happened. The free pass applies only to the case where the classifier intervened before any output was produced.

For most developers working on a single assistant, this is a minor line item. For teams running batch pipelines, evaluation harnesses, or safety-testing suites where a meaningful percentage of requests hit refusals, this is a real cost reduction with no code changes required.

One related improvement worth knowing about: the stop_details field on refusal responses is now publicly documented. It returns a category (cyber, bio, or null) alongside a human-readable explanation. No beta header is required to access it, and it means your application can actually route different classes of refusal to different handling logic, rather than treating every refusal as identical. On Claude Opus 4.7 and later models this field is included automatically on refusal responses.

The Advisor Tool Gets a `max_tokens` Cap

The advisor tool pairs a lighter executor model (typically Claude Sonnet or Haiku) with a higher-intelligence advisor model (Claude Opus) that provides strategic guidance when the executor escalates a decision. The bulk of token generation happens at executor rates, while Opus only weighs in when needed. It is a sensible pattern for long-horizon agentic tasks where you want Opus-level reasoning without paying Opus-level prices for every token in the conversation.

The problem until now was that there was no hard ceiling on how much the advisor could say per consultation. On lighter tasks it typically generates 400 to 700 tokens. On harder reasoning tasks that figure rises substantially, which adds both cost and latency in ways that were hard to predict.

The new max_tokens parameter on the advisor tool definition fixes that. You set it in tools[] on the advisor tool object, not at the top level of the request. That distinction matters: the top-level max_tokens applies to executor output only and does not bound advisor tokens at all.

Anthropic’s recommended starting point is 2048. Their internal testing on a hard reasoning benchmark found that this reduced mean advisor output by roughly 7x compared with leaving the cap unset, with near-zero truncation and no detectable quality degradation. Dropping to the minimum value of 1024 pushed the reduction to roughly 10x, but truncated around 10% of calls, which is probably too aggressive for most production workloads.

When the advisor does hit its cap, the result block carries stop_reason: "max_tokens". You can check output_tokens on the corresponding advisor_message entry in usage.iterations to see how close each call came to the ceiling, which gives you a data-driven basis for tuning the value over time.

A note on max_tokens versus prompt-based approaches: the parameter is a hard ceiling, not a soft instruction. If you tell the advisor in the prompt to be brief, it will try but may still run long when the task demands it. max_tokens guarantees the bound. You can use both together if you want to bias toward brevity and protect against the worst cases.

A Few Operational Notes Worth Keeping in Mind

The advisor tool remains in public beta. You need to include anthropic-beta: advisor-tool-2026-03-01 in your request headers.

Advisor tokens are billed at Opus rates and appear in usage.iterations entries with type: "advisor_message", not in the top-level usage summary. If you are tracking costs server-side, make sure you are reading from the right place.

There is no built-in conversation-level cap on how many times the advisor is consulted across a multi-turn exchange. The max_uses parameter caps calls within a single request, but across requests you need to track this client-side. When you hit your budget ceiling, remove the advisor tool from tools and strip all advisor_tool_result blocks from your message history, otherwise the API will return a 400 invalid_request_error.

What This Means in Practice

Neither of these updates requires a rethink of how you use the Claude API. They are cost-control improvements that work within existing patterns.

If you run pipelines where refusals are a regular occurrence, check whether your billing reflects the change and verify your stop_reason handling is correctly identifying the zero-token case. If you are already using the advisor tool, adding a max_tokens of 2048 to your advisor tool definition is a low-risk way to get more predictable per-request costs and lower latency without meaningfully degrading output quality.

Both changes are live now. The API release notes and advisor tool documentation have the full details.

Zero-Cost Billing for Empty Refusals

The Advisor Tool Gets a max_tokens Cap

A Few Operational Notes Worth Keeping in Mind

What This Means in Practice

The Advisor Tool Gets a `max_tokens` Cap