14. Rate Limiting and Abuse Prevention: Protecting Against the Attacks That Cost You Money
Part 14 of the LangGraph Agent Security series
Rate limiting is probably the most underappreciated control in LangGraph agent deployments. In conventional web applications, it’s straightforward: prevent brute-force attacks, limit API abuse, protect against credential stuffing. In LangGraph agents, the scope is broader and the consequences more severe.
An agent without effective rate limiting isn’t just vulnerable to service degradation. It’s vulnerable to economic exhaustion. A single manipulated agent running in a loop can generate hundreds of dollars in API costs in minutes. And because agents are designed to multiply a single user request into many downstream operations, the amplification potential is extraordinary.
Section 4.5 covered the threat landscape: infinite loop induction, context window flooding, tool call amplification, sponge attacks. These threats share a property — they exploit the agent’s autonomous, multi-step execution model to convert a bounded user action into unbounded resource consumption. The defenses here are built specifically around that property.
Why Single-Level Rate Limiting Isn’t Enough
In a conventional application, a rate limit at the API gateway is usually sufficient. In a LangGraph agent, abuse can originate at any level of execution, and a limit at one level provides no protection against abuse that stays within that limit while exhausting another.
You need independent controls at five levels:
Level 1: Request Rate Limiting
Controls how often a user can initiate new agent sessions
Prevents session-level brute force and statistical injection attempts
Level 2: Session Execution Limits
Controls how many steps, tool calls, and tokens a single session consumes
Prevents loop induction and context flooding within a session
Level 3: Tool Call Rate Limiting
Controls how frequently specific tools can be called
Prevents tool amplification attacks and external API exhaustion
Level 4: Token Budget Enforcement
Controls cumulative LLM token consumption per user/session/time period
Prevents economic DoS through cost amplification
Level 5: Cost Circuit Breakers
System-wide limits that trigger agent shutdown under abnormal cost
Last line of defense against runaway agents
Level 1: Request Rate Limiting
The first control in the pipeline, applied at the API gateway before any agent logic runs. Redis-backed distributed rate limiting handles multi-instance deployments correctly:
RATE_LIMIT_POLICIES: dict[str, RateLimitConfig] = {
"viewer": RateLimitConfig(
sessions_per_minute=2,
sessions_per_hour=20,
sessions_per_day=100,
max_concurrent_sessions=1,
lockout_duration_seconds=600,
),
"analyst": RateLimitConfig(
sessions_per_minute=5,
sessions_per_hour=60,
sessions_per_day=300,
max_concurrent_sessions=3,
lockout_duration_seconds=300,
),
"admin": RateLimitConfig(
sessions_per_minute=20,
sessions_per_hour=300,
sessions_per_day=1500,
max_concurrent_sessions=10,
lockout_duration_seconds=60,
),
}
The lockout mechanism is important: rapid rate limit violations trigger a temporary lockout, raising the cost of automated attack tooling that submits many variations trying to find one that works. The sliding window algorithm (using Redis sorted sets) handles the multi-window check correctly:
async def check_and_consume(self, user_id, user_role, action="session_start"):
policy = RATE_LIMIT_POLICIES.get(user_role, RATE_LIMIT_POLICIES["viewer"])
# Check lockout first
if await self.redis.exists(f"rl:lockout:{user_id}"):
ttl = await self.redis.ttl(f"rl:lockout:{user_id}")
return False, {"Retry-After": str(ttl)}
now = time.time()
windows = [
("minute", 60, policy.sessions_per_minute),
("hour", 3600, policy.sessions_per_hour),
("day", 86400, policy.sessions_per_day),
]
# Check all windows via pipeline
# ...
# Apply lockout if minute window exceeded
if violations and violations[0][0] == "minute":
await self.redis.setex(f"rl:lockout:{user_id}",
policy.lockout_duration_seconds, "1")
# ...
Return Retry-After headers on rate limit responses — this gives legitimate clients the information they need to back off gracefully rather than hammering the limit.
Level 2: Session Execution Limits
Once a session is established, a second tier governs its execution. The limits are designed specifically to prevent loop induction and context flooding:
SESSION_LIMITS: dict[str, SessionExecutionLimits] = {
"viewer": SessionExecutionLimits(
max_steps=10,
max_consecutive_llm_calls=3,
max_tool_calls_total=20,
max_tool_calls_per_type=5,
max_identical_tool_calls=2, # Prevent exact duplicate loops
max_context_tokens=50_000,
max_state_size_bytes=1_000_000,
max_session_duration_seconds=120,
max_node_execution_seconds=15.0,
),
"analyst": SessionExecutionLimits(
max_steps=25,
max_tool_calls_total=50,
max_identical_tool_calls=3,
max_context_tokens=100_000,
max_session_duration_seconds=600,
max_node_execution_seconds=30.0,
),
# ...
}
The max_identical_tool_calls limit deserves special attention. It catches exact loop execution where the agent calls the same tool with the same arguments repeatedly — a clear signal of either a loop induction attack or a stuck agent:
def check_tool_call_limits(self, state, tool_name, tool_args, limits):
# ...
# Duplicate call detection
args_hash = hashlib.sha256(
json.dumps(tool_args, sort_keys=True).encode()
).hexdigest()[:16]
duplicate_calls = [
c for c in metadata.get("tool_call_hashes", [])
if c["tool"] == tool_name and c["hash"] == args_hash
]
if len(duplicate_calls) >= limits.max_identical_tool_calls:
raise ToolCallLimitError(
f"Tool '{tool_name}' called with identical arguments "
f"{len(duplicate_calls)} times. Possible loop detected."
)
Warn at 80% of limits to give operators time to investigate before hard stops are hit.
Loop Detection
Step count limits catch runaway loops eventually. Loop detection catches them earlier by recognizing the pattern rather than just the count.
The key insight: effective loop detection needs to work at the semantic level, not just the exact state level. An adversarially induced loop may produce slightly different state on each iteration while making no genuine progress. Detecting that requires fingerprinting what matters — the agent’s apparent intent and context — rather than the raw state object:
class LoopDetector:
def record_and_check(self, session_id, state, node_name):
history = self._state_history[session_id]
exact_fingerprint = self._exact_fingerprint(state)
semantic_fingerprint = self._semantic_fingerprint(state)
history.append({
"exact": exact_fingerprint,
"semantic": semantic_fingerprint,
"node": node_name,
"step": state.get("metadata", {}).get("step_count", 0),
})
# Exact loop: identical state reached N times
exact_occurrences = sum(1 for e in history if e["exact"] == exact_fingerprint)
if exact_occurrences >= self.exact_threshold:
return {"loop_type": "exact", "severity": "high",
"detail": f"Identical state reached {exact_occurrences} times."}
# Semantic loop: similar state, node sequence repeating
node_sequence = [e["node"] for e in list(history)[-10:]]
if self._detect_repeating_subsequence(node_sequence):
semantic_occurrences = sum(1 for e in history
if e["semantic"] == semantic_fingerprint)
if semantic_occurrences >= self.semantic_threshold:
return {"loop_type": "semantic", "severity": "medium",
"detail": "Similar state reached repeatedly. No real progress."}
# Oscillation: A→B→A→B pattern
if len(history) >= 4:
recent = [e["node"] for e in list(history)[-4:]]
if recent[0] == recent[2] and recent[1] == recent[3]:
return {"loop_type": "oscillation", "severity": "high",
"detail": f"Agent oscillating: {recent[0]} ↔ {recent[1]}"}
return None
def _semantic_fingerprint(self, state: dict) -> str:
# Fingerprint based on intent/context shape, not exact values
external = state.get("external", {})
semantic_features = {
"has_retrieved_docs": bool(external.get("retrieved_documents")),
"tool_types_used": sorted(set(
state.get("metadata", {}).get("tool_calls_made", [])
)),
"task_status": state.get("task_status"),
"draft_length_bucket": len(state.get("draft_response", "") or "") // 500,
}
return hashlib.sha256(
json.dumps(semantic_features, sort_keys=True).encode()
).hexdigest()[:16]
Three loop types to catch: exact (identical state), semantic (similar state without progress), and oscillation (A↔B cycling). High-severity loops get a hard stop. Medium-severity loops generate a warning and log for investigation.
Level 4: Token Budget Enforcement
Token budgets translate rate limiting into the actual currency of LLM API cost. The three-scope structure covers different abuse vectors:
TOKEN_BUDGETS: dict[str, dict[BudgetScope, TokenBudget]] = {
"viewer": {
BudgetScope.SESSION: TokenBudget(
soft_limit=25_000, # Warn at 25K
hard_limit=50_000, # Terminate at 50K
warning_action="warn_user",
hard_limit_action="terminate",
),
BudgetScope.USER_DAILY: TokenBudget(
soft_limit=200_000,
hard_limit=500_000,
warning_action="alert_security",
hard_limit_action="reject_new_sessions",
),
},
# ...
}
Soft limits trigger warnings at the session level (telling the user they’re using a lot) and security alerts at the daily level (telling the security team something anomalous may be happening). Hard limits terminate the session or reject new sessions.
The tenant-level budget is the one that protects against distributed abuse — many coordinated accounts from the same organization all running up costs simultaneously:
async def _check_tenant_budget(self, tenant_id, daily_total):
tenant_hard_limit = 100_000_000 # 100M tokens/day per tenant
if daily_total > tenant_hard_limit:
await self.alerter.fire(alert_type="tenant_token_budget_exceeded", ...)
raise TokenBudgetExceededError("Tenant daily token budget exceeded.")
Level 5: Cost Circuit Breakers
Token budgets operate per-user. Cost circuit breakers operate system-wide. They monitor aggregate spending and shut down agent operations when cost patterns suggest a coordinated attack or runaway system behavior:
CIRCUIT_BREAKER_THRESHOLDS = {
"max_cost_minute_usd": 50.0, # $50 in a minute → trip
"max_cost_hour_usd": 500.0, # $500 in an hour → trip
"max_cost_day_usd": 2000.0, # $2000 in a day → trip
"recovery_window_seconds": 300,
}
The classic circuit breaker pattern applies: CLOSED (normal) → OPEN (tripped) → HALF_OPEN (testing recovery) → CLOSED (recovered). In the OPEN state, all agent operations are rejected. After the recovery window, a small number of test requests are allowed to verify the system has stabilized before fully reopening.
Circuit breaker state lives in Redis for distributed consistency — all agent instances see the same state:
async def record_cost_event(self, cost_usd, time_window="minute", context=None):
# Accumulate cost in rolling window using Redis sorted sets
# ...
# Trip if threshold exceeded
if total_cost > threshold:
await self._trip(
reason=f"Cost threshold exceeded: ${total_cost:.2f}/{time_window}",
context=context,
)
Circuit breaker trips must trigger immediate alerts. This is the most aggressive automated response in the system — it stops all agent operations. It should be rare. If it’s tripping regularly, either the thresholds are too aggressive or there’s a recurring abuse problem that needs investigation.
Tool-Level Rate Limiting
Some tools warrant their own rate limits independent of session limits:
TOOL_RATE_LIMITS: dict[str, ToolRateLimit] = {
# Communication tools — very conservative
"send_email": ToolRateLimit(
calls_per_minute=2,
calls_per_hour=20,
concurrent_calls=1,
cooldown_after_error_seconds=60,
),
# Code execution — expensive
"execute_python": ToolRateLimit(
calls_per_minute=10,
calls_per_hour=100,
concurrent_calls=2,
cooldown_after_error_seconds=30,
),
# External search — respect the API's own limits
"search_web": ToolRateLimit(
calls_per_minute=20,
calls_per_hour=200,
concurrent_calls=3,
cooldown_after_error_seconds=10,
),
}
The error cooldown with exponential backoff is important for operational stability: if a tool keeps failing, don’t let the agent (or an attacker) hammer it repeatedly. Each failure doubles the cooldown up to a maximum of one hour.
Detecting Coordinated Abuse
Beyond individual rate limits, systematic abuse involves patterns that span multiple sessions and users. An attacker with many accounts can stay within per-session limits while causing significant aggregate damage:
class AbusePatternDetector:
async def check_patterns(self, user_id, session_id, event_type, context):
patterns = []
# Pattern 1: Many session failures from one user
# More than 20 failures per hour suggests automated attack tooling
failure_count = await self._check_failure_rate(user_id)
if failure_count > 20:
patterns.append({
"type": "high_failure_rate",
"detail": f"{failure_count} session failures in last hour. Possible automated attack.",
"severity": "high",
})
# Pattern 2: Concentrated injection attempts
# More than 10 injection detections per hour from one user → targeted attack
injection_count = await self._check_injection_rate(user_id)
if injection_count > 10:
patterns.append({
"type": "concentrated_injection_attempts",
"severity": "critical",
"detail": f"{injection_count} injection detections in last hour.",
})
# Pattern 3: Cross-session abuse of high-risk tools
# send_email called 50+ times across sessions in 24h
if event_type == "tool_call":
tool_pattern = await self._check_tool_abuse(
user_id, context.get("tool_name", "")
)
if tool_pattern:
patterns.append(tool_pattern)
# Pattern 4: Coordinated tenant-wide abuse
# 1000+ rate-limited events from one tenant in an hour
if context.get("tenant_id"):
tenant_pattern = await self._check_tenant_coordination(
context["tenant_id"]
)
if tenant_pattern:
patterns.append(tenant_pattern)
return patterns
The concentrated injection attempt detection is the one that’s caught the most in my testing. A user who’s genuinely having their messages misclassified as injection might trigger a few flags. A user running automated tooling trying to find an injection payload that works triggers ten or more in an hour.
Putting It All Together
The five levels of rate limiting integrate as a pipeline in the request processing flow:
class AbusePreventionPipeline:
async def check_session_start(self, user):
# 1. System circuit breaker — stops everything if tripped
await self.circuit_breaker.check()
# 2. Request rate limits
allowed, headers = await self.rate_limiter.check_and_consume(
user_id=user.user_id, user_role=role, action="session_start"
)
if not allowed:
raise RateLimitExceededError(...)
# 3. Concurrent session check
concurrent = await self.rate_limiter.get_concurrent_session_count(user.user_id)
if concurrent >= policy.max_concurrent_sessions:
raise RateLimitExceededError(...)
# 4. Remaining token budget
if budgets.get("user_daily", float("inf")) <= 0:
raise TokenBudgetExceededError("Daily budget exhausted.")
async def check_execution_step(self, state, node_name, user):
# 1. Step count limit
self.execution_guard.check_step_limit(state, limits)
# 2. Session duration
self.execution_guard.check_session_duration(start_time, limits, session_id)
# 3. State size
self.execution_guard.check_state_size(state, limits)
# 4. Loop detection
loop_result = self.loop_detector.record_and_check(session_id, state, node_name)
if loop_result and loop_result["severity"] == "high":
raise LoopDetectedError(loop_result["description"])
async def check_tool_call(self, tool_name, tool_args, state, user):
# 1. Session-level tool limits
self.execution_guard.check_tool_call_limits(state, tool_name, tool_args, limits)
# 2. Tool-level rate limits
allowed, retry_after = await tool_rate_limiter.check_and_acquire(
tool_name=tool_name, session_id=session_id, user_id=user.user_id
)
if not allowed:
raise ToolRateLimitError(...)
What I’ve Learned
The most important insight about agent rate limiting: it’s not the same as web application rate limiting, and treating it as such leaves large gaps.
Web rate limiting is primarily about frequency of incoming requests. Agent rate limiting has to cover that plus the autonomous execution within each session, plus the token economics that don’t exist in conventional applications. An agent that receives one request per minute can still cause enormous damage if each session runs for 200 steps consuming 500K tokens. The conventional rate limit at the gateway doesn’t see any of that.
The second thing I’ve learned: token budgets need to be both per-session and cumulative. A per-session limit without a daily cumulative limit lets an attacker run many short sessions. A daily cumulative limit without a per-session limit lets one runaway session exhaust the entire day’s budget. You need both.
And circuit breakers, while the most aggressive control, are the one I’ve come to appreciate most. They’re the safety net under everything else. If all the other controls fail simultaneously — unlikely but possible — the circuit breaker trips before the bill becomes catastrophic. That’s worth having.
Rate Limiting Checklist
Request rate limiting:
- Per-user limits defined at minute, hour, and day granularity for all roles
- Concurrent session limits defined per role
- Distributed rate limiter (Redis) for multi-instance deployments
- Lockout periods applied after rapid violations
- Retry-After headers returned in rate limit responses
Session execution limits:
- Maximum step count enforced per session per role
- Maximum total tool calls and per-tool-type limits enforced
- Duplicate tool call detection active
- Maximum session duration with hard wall-clock cutoff
- State size limits prevent unbounded accumulation
Loop detection:
- Exact state fingerprint loop detection active
- Semantic loop detection for approximate repetition
- Node oscillation detection for A↔B cycling
- Rolling window covers sufficient recent history
Token budgets:
- Per-session budgets defined for all roles
- Per-user-daily budgets tracked across sessions
- Tenant-level budgets protect against distributed abuse
- Soft limits trigger warnings before hard limits terminate
Circuit breakers:
- Cost thresholds configured for minute, hour, and day
- Circuit breaker state in distributed store
- Recovery window defined and tested
- Trips trigger immediate alerts to operations team
Tool-level limits:
- High-impact tools have per-minute and per-hour limits
- Concurrent call limits prevent thundering herd
- Error cooldowns with exponential backoff on failures
Abuse detection:
- Cross-session failure rate monitoring active
- Injection attempt concentration detection active
- Cross-session high-risk tool abuse tracked
- Tenant-level coordination detection active
This is Part 14 of an ongoing series on LangGraph agent security. Previous posts: Part 1: Introduction · Part 2: Architecture Primer · Part 3: Attack Surface Analysis · Part 4: Core Threat Categories · Part 5: Threat Modeling · Part 6: Input Validation · Part 7: Tool Security · Part 8: State and Memory Security · Part 9: Multi-Agent Trust Boundaries · Part 10: Output Guardrails · Part 11: Authentication and Authorization · Part 12: Observability and Monitoring · Part 13: Human-in-the-Loop.