← Back to Blog
Claude API

Claude API Rate Limits Explained: RPM, TPD & Concurrency (2026)

May 11, 2026
Claude API Rate Limits Explained: RPM, TPD & Concurrency (2026)

TL;DR

Claude API enforces three independent rate limit layers: RPM (requests per minute), TPD (daily token budget — input and output separately), and concurrency. When you hit any of them, you get HTTP 429 or 529 with a retry-after header. This guide covers each layer with working Python retry code.

Why Does Rate Limiting Happen?

Anthropic applies four independent limits:

RPM: requests per minute cap → 429 rate_limit_error

Input/Output TPD: daily token budget → 429 or 529 overloaded_error

Concurrency: max in-flight requests at once → 529 overloaded_error

These limits are independent — you can hit concurrency without touching RPM.

From our logs at CodeGateway: over 90% of rate limit errors come from bulk batch jobs firing 10+ simultaneous requests. Individual developers in normal usage almost rarely hit them.

Anthropic's Rate Limit Tiers (2026)

Anthropic assigns accounts to Tier 1–5 based on cumulative spend:

Tier 1 ($0, new accounts): 50 RPM / 5M input TPD / 100K output TPD

Tier 2 ($40 cumulative): 1,000 RPM / 40M input TPD / 400K output TPD

Tier 3 ($200 cumulative): 2,000 RPM / 200M input TPD / 2M output TPD

Tier 4 ($2,000 cumulative): 4,000 RPM / 1B input TPD / 10M output TPD

Source: Anthropic official Rate Limits documentation (May 2026). Always check the official page for current limits.

▶ Full official limits: Anthropic Rate Limits documentation

Reading the Error Response

HTTP Status Codes

429 rate_limit_error: RPM exceeded — wait retry-after seconds

529 overloaded_error: Service overload (includes concurrency cap) — use exponential backoff

500 api_error: Server error, not a rate limit

The 429 Response Headers

bash
retry-after: 12
x-ratelimit-limit-requests: 50
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2026-05-11T04:00:00Z

retry-after: exact seconds to wait — following this gives >95% success on the next attempt

x-ratelimit-reset-requests: when your RPM window resets (ISO 8601)

Retry Logic That Actually Works

python
import anthropic
import time

client = anthropic.Anthropic(
    api_key="your-codegateway-key",
    base_url="https://api.codegateway.dev/v1"
)

def call_with_retry(prompt: str, max_retries: int = 5) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
        except anthropic.RateLimitError as e:
            retry_after = int(e.response.headers.get("retry-after", 60))
            print(f"Rate limited, waiting {retry_after}s (attempt {attempt + 1})")
            time.sleep(retry_after)
        except anthropic.APIStatusError as e:
            if e.status_code == 529:
                wait = 30 * (2 ** attempt)
                print(f"Overloaded, waiting {wait}s")
                time.sleep(wait)
            else:
                raise
    raise RuntimeError("Max retries exceeded")

Measured from our logs: using the exact retry-after value averages 8–15 seconds. Fixed 60-second waits waste 4–7x that.

Concurrency: The Limit Developers Miss Most

RPM counts requests per minute. Concurrency counts requests in flight at the same time. They're independent.

Classic footgun: your RPM is 50/minute, but you fire asyncio.gather() with 20 tasks at once → concurrency exceeded → 529, even though your RPM counter is fine.

python
import asyncio
import anthropic

async def batch_process(prompts: list[str], max_concurrent: int = 5) -> list[str]:
    semaphore = asyncio.Semaphore(max_concurrent)
    client = anthropic.AsyncAnthropic(
        api_key="your-codegateway-key",
        base_url="https://api.codegateway.dev/v1"
    )
    
    async def single_call(prompt: str) -> str:
        async with semaphore:
            response = await client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
    
    return await asyncio.gather(*[single_call(p) for p in prompts])

Recommended concurrency limits: 3 for Tier 1, 8–10 for Tier 2+.

Engineering Techniques to Stay Under Limits

Rate Spreading

python
import time

def rate_limited_batch(prompts: list[str], rpm_limit: int = 40) -> list[str]:
    interval = 60.0 / rpm_limit  # 20% headroom below the actual limit
    results = []
    for prompt in prompts:
        start = time.time()
        results.append(call_with_retry(prompt))
        elapsed = time.time() - start
        if elapsed < interval:
            time.sleep(interval - elapsed)
    return results

Token Budget Awareness

Don't set max_tokens=4096 as a catch-all. Anthropic counts your max_tokens request against your daily token budget, not just what you actually receive. Set it to what you actually need.

Using Claude API via CodeGateway

When you use CodeGateway as your API endpoint:

Quota is separate: your CodeGateway credit balance and effective throughput are separate from any direct Anthropic account. CodeGateway routes across multiple providers, giving higher aggregate capacity.

Error codes pass through: CodeGateway transparently forwards Anthropic's 429 and 529 status codes and all rate limit headers. The retry code above works without modification.

Distinguishing gateway vs upstream limits: in the rare case CodeGateway imposes a limit, the response includes "source": "gateway" in the error body.

Common Questions

Q: I'm only sending one request at a time. Why am I still getting 429?

A: Most likely multiple processes or services sharing the same API key. Check your CodeGateway request logs to find which service is consuming quota.

Q: Is 529 worse than 429?

A: Neither is worse — both are temporary. 429 is your quota, wait retry-after. 529 is Anthropic-side load, clears in 30–60 seconds and has nothing to do with your quota.

Q: What's the difference between rate_limit_error and overloaded_error?

A: rate_limit_error (429) means you sent too many requests — your quota is the issue. overloaded_error (529) is Anthropic's infrastructure under load — it can happen even when you're well under your quota. Fix for 529 is always exponential backoff.

Summary

Rate limits are manageable once you understand the three independent dimensions (RPM, TPD, concurrency) and respect the retry-after signal. The retry pattern above handles 95%+ of real-world rate limit scenarios without manual intervention.

Tested in May 2026 using CodeGateway production request logs and the Anthropic Python SDK.

AuthorCodeGateway TeamReviewed on2026-05-16