TL;DR
Claude API enforces three independent rate limit layers: RPM (requests per minute), TPD (daily token budget — input and output separately), and concurrency. When you hit any of them, you get HTTP 429 or 529 with a retry-after header. This guide covers each layer with working Python retry code.
Why Does Rate Limiting Happen?
Anthropic applies four independent limits:
RPM: requests per minute cap → 429 rate_limit_error
Input/Output TPD: daily token budget → 429 or 529 overloaded_error
Concurrency: max in-flight requests at once → 529 overloaded_error
These limits are independent — you can hit concurrency without touching RPM.
From our logs at CodeGateway: over 90% of rate limit errors come from bulk batch jobs firing 10+ simultaneous requests. Individual developers in normal usage almost rarely hit them.
Anthropic's Rate Limit Tiers (2026)
Anthropic assigns accounts to Tier 1–5 based on cumulative spend:
Tier 1 ($0, new accounts): 50 RPM / 5M input TPD / 100K output TPD
Tier 2 ($40 cumulative): 1,000 RPM / 40M input TPD / 400K output TPD
Tier 3 ($200 cumulative): 2,000 RPM / 200M input TPD / 2M output TPD
Tier 4 ($2,000 cumulative): 4,000 RPM / 1B input TPD / 10M output TPD
Source: Anthropic official Rate Limits documentation (May 2026). Always check the official page for current limits.
▶ Full official limits: Anthropic Rate Limits documentation
Reading the Error Response
HTTP Status Codes
429 rate_limit_error: RPM exceeded — wait retry-after seconds
529 overloaded_error: Service overload (includes concurrency cap) — use exponential backoff
500 api_error: Server error, not a rate limit
The 429 Response Headers
retry-after: 12
x-ratelimit-limit-requests: 50
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2026-05-11T04:00:00Zretry-after: exact seconds to wait — following this gives >95% success on the next attempt
x-ratelimit-reset-requests: when your RPM window resets (ISO 8601)
Retry Logic That Actually Works
import anthropic
import time
client = anthropic.Anthropic(
api_key="your-codegateway-key",
base_url="https://api.codegateway.dev/v1"
)
def call_with_retry(prompt: str, max_retries: int = 5) -> str:
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
except anthropic.RateLimitError as e:
retry_after = int(e.response.headers.get("retry-after", 60))
print(f"Rate limited, waiting {retry_after}s (attempt {attempt + 1})")
time.sleep(retry_after)
except anthropic.APIStatusError as e:
if e.status_code == 529:
wait = 30 * (2 ** attempt)
print(f"Overloaded, waiting {wait}s")
time.sleep(wait)
else:
raise
raise RuntimeError("Max retries exceeded")Measured from our logs: using the exact retry-after value averages 8–15 seconds. Fixed 60-second waits waste 4–7x that.
Concurrency: The Limit Developers Miss Most
RPM counts requests per minute. Concurrency counts requests in flight at the same time. They're independent.
Classic footgun: your RPM is 50/minute, but you fire asyncio.gather() with 20 tasks at once → concurrency exceeded → 529, even though your RPM counter is fine.
import asyncio
import anthropic
async def batch_process(prompts: list[str], max_concurrent: int = 5) -> list[str]:
semaphore = asyncio.Semaphore(max_concurrent)
client = anthropic.AsyncAnthropic(
api_key="your-codegateway-key",
base_url="https://api.codegateway.dev/v1"
)
async def single_call(prompt: str) -> str:
async with semaphore:
response = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
return await asyncio.gather(*[single_call(p) for p in prompts])Recommended concurrency limits: 3 for Tier 1, 8–10 for Tier 2+.
Engineering Techniques to Stay Under Limits
Rate Spreading
import time
def rate_limited_batch(prompts: list[str], rpm_limit: int = 40) -> list[str]:
interval = 60.0 / rpm_limit # 20% headroom below the actual limit
results = []
for prompt in prompts:
start = time.time()
results.append(call_with_retry(prompt))
elapsed = time.time() - start
if elapsed < interval:
time.sleep(interval - elapsed)
return resultsToken Budget Awareness
Don't set max_tokens=4096 as a catch-all. Anthropic counts your max_tokens request against your daily token budget, not just what you actually receive. Set it to what you actually need.
Using Claude API via CodeGateway
When you use CodeGateway as your API endpoint:
Quota is separate: your CodeGateway credit balance and effective throughput are separate from any direct Anthropic account. CodeGateway routes across multiple providers, giving higher aggregate capacity.
Error codes pass through: CodeGateway transparently forwards Anthropic's 429 and 529 status codes and all rate limit headers. The retry code above works without modification.
Distinguishing gateway vs upstream limits: in the rare case CodeGateway imposes a limit, the response includes "source": "gateway" in the error body.
Common Questions
Q: I'm only sending one request at a time. Why am I still getting 429?
A: Most likely multiple processes or services sharing the same API key. Check your CodeGateway request logs to find which service is consuming quota.
Q: Is 529 worse than 429?
A: Neither is worse — both are temporary. 429 is your quota, wait retry-after. 529 is Anthropic-side load, clears in 30–60 seconds and has nothing to do with your quota.
Q: What's the difference between rate_limit_error and overloaded_error?
A: rate_limit_error (429) means you sent too many requests — your quota is the issue. overloaded_error (529) is Anthropic's infrastructure under load — it can happen even when you're well under your quota. Fix for 529 is always exponential backoff.
Summary
Rate limits are manageable once you understand the three independent dimensions (RPM, TPD, concurrency) and respect the retry-after signal. The retry pattern above handles 95%+ of real-world rate limit scenarios without manual intervention.
Tested in May 2026 using CodeGateway production request logs and the Anthropic Python SDK.
