14. Rate Limiting and Abuse Prevention: Protecting Against the Attacks That Cost You Money

Part 14 of the LangGraph Agent Security series

Rate limiting is probably the most underappreciated control in LangGraph agent deployments. In conventional web applications, it’s straightforward: prevent brute-force attacks, limit API abuse, protect against credential stuffing. In LangGraph agents, the scope is broader and the consequences more severe.

An agent without effective rate limiting isn’t just vulnerable to service degradation. It’s vulnerable to economic exhaustion. A single manipulated agent running in a loop can generate hundreds of dollars in API costs in minutes. And because agents are designed to multiply a single user request into many downstream operations, the amplification potential is extraordinary.

Section 4.5 covered the threat landscape: infinite loop induction, context window flooding, tool call amplification, sponge attacks. These threats share a property — they exploit the agent’s autonomous, multi-step execution model to convert a bounded user action into unbounded resource consumption. The defenses here are built specifically around that property.

Why Single-Level Rate Limiting Isn’t Enough

In a conventional application, a rate limit at the API gateway is usually sufficient. In a LangGraph agent, abuse can originate at any level of execution, and a limit at one level provides no protection against abuse that stays within that limit while exhausting another.

You need independent controls at five levels:

Level 1: Request Rate Limiting
  Controls how often a user can initiate new agent sessions
  Prevents session-level brute force and statistical injection attempts

Level 2: Session Execution Limits
  Controls how many steps, tool calls, and tokens a single session consumes
  Prevents loop induction and context flooding within a session

Level 3: Tool Call Rate Limiting
  Controls how frequently specific tools can be called
  Prevents tool amplification attacks and external API exhaustion

Level 4: Token Budget Enforcement
  Controls cumulative LLM token consumption per user/session/time period
  Prevents economic DoS through cost amplification

Level 5: Cost Circuit Breakers
  System-wide limits that trigger agent shutdown under abnormal cost
  Last line of defense against runaway agents

Level 1: Request Rate Limiting

The first control in the pipeline, applied at the API gateway before any agent logic runs. Redis-backed distributed rate limiting handles multi-instance deployments correctly:

RATE_LIMIT_POLICIES: dict[str, RateLimitConfig] = {
    "viewer": RateLimitConfig(
        sessions_per_minute=2,
        sessions_per_hour=20,
        sessions_per_day=100,
        max_concurrent_sessions=1,
        lockout_duration_seconds=600,
    ),
    "analyst": RateLimitConfig(
        sessions_per_minute=5,
        sessions_per_hour=60,
        sessions_per_day=300,
        max_concurrent_sessions=3,
        lockout_duration_seconds=300,
    ),
    "admin": RateLimitConfig(
        sessions_per_minute=20,
        sessions_per_hour=300,
        sessions_per_day=1500,
        max_concurrent_sessions=10,
        lockout_duration_seconds=60,
    ),
}

The lockout mechanism is important: rapid rate limit violations trigger a temporary lockout, raising the cost of automated attack tooling that submits many variations trying to find one that works. The sliding window algorithm (using Redis sorted sets) handles the multi-window check correctly:

async def check_and_consume(self, user_id, user_role, action="session_start"):
    policy = RATE_LIMIT_POLICIES.get(user_role, RATE_LIMIT_POLICIES["viewer"])

    # Check lockout first
    if await self.redis.exists(f"rl:lockout:{user_id}"):
        ttl = await self.redis.ttl(f"rl:lockout:{user_id}")
        return False, {"Retry-After": str(ttl)}

    now = time.time()
    windows = [
        ("minute", 60, policy.sessions_per_minute),
        ("hour", 3600, policy.sessions_per_hour),
        ("day", 86400, policy.sessions_per_day),
    ]

    # Check all windows via pipeline
    # ...

    # Apply lockout if minute window exceeded
    if violations and violations[0][0] == "minute":
        await self.redis.setex(f"rl:lockout:{user_id}",
                               policy.lockout_duration_seconds, "1")
    # ...

Return Retry-After headers on rate limit responses — this gives legitimate clients the information they need to back off gracefully rather than hammering the limit.

Level 2: Session Execution Limits

Once a session is established, a second tier governs its execution. The limits are designed specifically to prevent loop induction and context flooding:

SESSION_LIMITS: dict[str, SessionExecutionLimits] = {
    "viewer": SessionExecutionLimits(
        max_steps=10,
        max_consecutive_llm_calls=3,
        max_tool_calls_total=20,
        max_tool_calls_per_type=5,
        max_identical_tool_calls=2,     # Prevent exact duplicate loops
        max_context_tokens=50_000,
        max_state_size_bytes=1_000_000,
        max_session_duration_seconds=120,
        max_node_execution_seconds=15.0,
    ),
    "analyst": SessionExecutionLimits(
        max_steps=25,
        max_tool_calls_total=50,
        max_identical_tool_calls=3,
        max_context_tokens=100_000,
        max_session_duration_seconds=600,
        max_node_execution_seconds=30.0,
    ),
    # ...
}

The max_identical_tool_calls limit deserves special attention. It catches exact loop execution where the agent calls the same tool with the same arguments repeatedly — a clear signal of either a loop induction attack or a stuck agent:

def check_tool_call_limits(self, state, tool_name, tool_args, limits):
    # ...

    # Duplicate call detection
    args_hash = hashlib.sha256(
        json.dumps(tool_args, sort_keys=True).encode()
    ).hexdigest()[:16]

    duplicate_calls = [
        c for c in metadata.get("tool_call_hashes", [])
        if c["tool"] == tool_name and c["hash"] == args_hash
    ]

    if len(duplicate_calls) >= limits.max_identical_tool_calls:
        raise ToolCallLimitError(
            f"Tool '{tool_name}' called with identical arguments "
            f"{len(duplicate_calls)} times. Possible loop detected."
        )

Warn at 80% of limits to give operators time to investigate before hard stops are hit.

Loop Detection

Step count limits catch runaway loops eventually. Loop detection catches them earlier by recognizing the pattern rather than just the count.

The key insight: effective loop detection needs to work at the semantic level, not just the exact state level. An adversarially induced loop may produce slightly different state on each iteration while making no genuine progress. Detecting that requires fingerprinting what matters — the agent’s apparent intent and context — rather than the raw state object:

class LoopDetector:
    def record_and_check(self, session_id, state, node_name):
        history = self._state_history[session_id]

        exact_fingerprint = self._exact_fingerprint(state)
        semantic_fingerprint = self._semantic_fingerprint(state)

        history.append({
            "exact": exact_fingerprint,
            "semantic": semantic_fingerprint,
            "node": node_name,
            "step": state.get("metadata", {}).get("step_count", 0),
        })

        # Exact loop: identical state reached N times
        exact_occurrences = sum(1 for e in history if e["exact"] == exact_fingerprint)
        if exact_occurrences >= self.exact_threshold:
            return {"loop_type": "exact", "severity": "high",
                    "detail": f"Identical state reached {exact_occurrences} times."}

        # Semantic loop: similar state, node sequence repeating
        node_sequence = [e["node"] for e in list(history)[-10:]]
        if self._detect_repeating_subsequence(node_sequence):
            semantic_occurrences = sum(1 for e in history
                                      if e["semantic"] == semantic_fingerprint)
            if semantic_occurrences >= self.semantic_threshold:
                return {"loop_type": "semantic", "severity": "medium",
                        "detail": "Similar state reached repeatedly. No real progress."}

        # Oscillation: A→B→A→B pattern
        if len(history) >= 4:
            recent = [e["node"] for e in list(history)[-4:]]
            if recent[0] == recent[2] and recent[1] == recent[3]:
                return {"loop_type": "oscillation", "severity": "high",
                        "detail": f"Agent oscillating: {recent[0]} ↔ {recent[1]}"}

        return None

    def _semantic_fingerprint(self, state: dict) -> str:
        # Fingerprint based on intent/context shape, not exact values
        external = state.get("external", {})
        semantic_features = {
            "has_retrieved_docs": bool(external.get("retrieved_documents")),
            "tool_types_used": sorted(set(
                state.get("metadata", {}).get("tool_calls_made", [])
            )),
            "task_status": state.get("task_status"),
            "draft_length_bucket": len(state.get("draft_response", "") or "") // 500,
        }
        return hashlib.sha256(
            json.dumps(semantic_features, sort_keys=True).encode()
        ).hexdigest()[:16]

Three loop types to catch: exact (identical state), semantic (similar state without progress), and oscillation (A↔B cycling). High-severity loops get a hard stop. Medium-severity loops generate a warning and log for investigation.

Level 4: Token Budget Enforcement

Token budgets translate rate limiting into the actual currency of LLM API cost. The three-scope structure covers different abuse vectors:

TOKEN_BUDGETS: dict[str, dict[BudgetScope, TokenBudget]] = {
    "viewer": {
        BudgetScope.SESSION: TokenBudget(
            soft_limit=25_000,    # Warn at 25K
            hard_limit=50_000,    # Terminate at 50K
            warning_action="warn_user",
            hard_limit_action="terminate",
        ),
        BudgetScope.USER_DAILY: TokenBudget(
            soft_limit=200_000,
            hard_limit=500_000,
            warning_action="alert_security",
            hard_limit_action="reject_new_sessions",
        ),
    },
    # ...
}

Soft limits trigger warnings at the session level (telling the user they’re using a lot) and security alerts at the daily level (telling the security team something anomalous may be happening). Hard limits terminate the session or reject new sessions.

The tenant-level budget is the one that protects against distributed abuse — many coordinated accounts from the same organization all running up costs simultaneously:

async def _check_tenant_budget(self, tenant_id, daily_total):
    tenant_hard_limit = 100_000_000  # 100M tokens/day per tenant
    if daily_total > tenant_hard_limit:
        await self.alerter.fire(alert_type="tenant_token_budget_exceeded", ...)
        raise TokenBudgetExceededError("Tenant daily token budget exceeded.")

Level 5: Cost Circuit Breakers

Token budgets operate per-user. Cost circuit breakers operate system-wide. They monitor aggregate spending and shut down agent operations when cost patterns suggest a coordinated attack or runaway system behavior:

CIRCUIT_BREAKER_THRESHOLDS = {
    "max_cost_minute_usd": 50.0,    # $50 in a minute → trip
    "max_cost_hour_usd": 500.0,     # $500 in an hour → trip
    "max_cost_day_usd": 2000.0,     # $2000 in a day → trip
    "recovery_window_seconds": 300,
}

The classic circuit breaker pattern applies: CLOSED (normal) → OPEN (tripped) → HALF_OPEN (testing recovery) → CLOSED (recovered). In the OPEN state, all agent operations are rejected. After the recovery window, a small number of test requests are allowed to verify the system has stabilized before fully reopening.

Circuit breaker state lives in Redis for distributed consistency — all agent instances see the same state:

async def record_cost_event(self, cost_usd, time_window="minute", context=None):
    # Accumulate cost in rolling window using Redis sorted sets
    # ...

    # Trip if threshold exceeded
    if total_cost > threshold:
        await self._trip(
            reason=f"Cost threshold exceeded: ${total_cost:.2f}/{time_window}",
            context=context,
        )

Circuit breaker trips must trigger immediate alerts. This is the most aggressive automated response in the system — it stops all agent operations. It should be rare. If it’s tripping regularly, either the thresholds are too aggressive or there’s a recurring abuse problem that needs investigation.

Tool-Level Rate Limiting

Some tools warrant their own rate limits independent of session limits:

TOOL_RATE_LIMITS: dict[str, ToolRateLimit] = {
    # Communication tools — very conservative
    "send_email": ToolRateLimit(
        calls_per_minute=2,
        calls_per_hour=20,
        concurrent_calls=1,
        cooldown_after_error_seconds=60,
    ),
    # Code execution — expensive
    "execute_python": ToolRateLimit(
        calls_per_minute=10,
        calls_per_hour=100,
        concurrent_calls=2,
        cooldown_after_error_seconds=30,
    ),
    # External search — respect the API's own limits
    "search_web": ToolRateLimit(
        calls_per_minute=20,
        calls_per_hour=200,
        concurrent_calls=3,
        cooldown_after_error_seconds=10,
    ),
}

The error cooldown with exponential backoff is important for operational stability: if a tool keeps failing, don’t let the agent (or an attacker) hammer it repeatedly. Each failure doubles the cooldown up to a maximum of one hour.

Detecting Coordinated Abuse

Beyond individual rate limits, systematic abuse involves patterns that span multiple sessions and users. An attacker with many accounts can stay within per-session limits while causing significant aggregate damage:

class AbusePatternDetector:
    async def check_patterns(self, user_id, session_id, event_type, context):
        patterns = []

        # Pattern 1: Many session failures from one user
        # More than 20 failures per hour suggests automated attack tooling
        failure_count = await self._check_failure_rate(user_id)
        if failure_count > 20:
            patterns.append({
                "type": "high_failure_rate",
                "detail": f"{failure_count} session failures in last hour. Possible automated attack.",
                "severity": "high",
            })

        # Pattern 2: Concentrated injection attempts
        # More than 10 injection detections per hour from one user → targeted attack
        injection_count = await self._check_injection_rate(user_id)
        if injection_count > 10:
            patterns.append({
                "type": "concentrated_injection_attempts",
                "severity": "critical",
                "detail": f"{injection_count} injection detections in last hour.",
            })

        # Pattern 3: Cross-session abuse of high-risk tools
        # send_email called 50+ times across sessions in 24h
        if event_type == "tool_call":
            tool_pattern = await self._check_tool_abuse(
                user_id, context.get("tool_name", "")
            )
            if tool_pattern:
                patterns.append(tool_pattern)

        # Pattern 4: Coordinated tenant-wide abuse
        # 1000+ rate-limited events from one tenant in an hour
        if context.get("tenant_id"):
            tenant_pattern = await self._check_tenant_coordination(
                context["tenant_id"]
            )
            if tenant_pattern:
                patterns.append(tenant_pattern)

        return patterns

The concentrated injection attempt detection is the one that’s caught the most in my testing. A user who’s genuinely having their messages misclassified as injection might trigger a few flags. A user running automated tooling trying to find an injection payload that works triggers ten or more in an hour.

Putting It All Together

The five levels of rate limiting integrate as a pipeline in the request processing flow:

class AbusePreventionPipeline:
    async def check_session_start(self, user):
        # 1. System circuit breaker — stops everything if tripped
        await self.circuit_breaker.check()

        # 2. Request rate limits
        allowed, headers = await self.rate_limiter.check_and_consume(
            user_id=user.user_id, user_role=role, action="session_start"
        )
        if not allowed:
            raise RateLimitExceededError(...)

        # 3. Concurrent session check
        concurrent = await self.rate_limiter.get_concurrent_session_count(user.user_id)
        if concurrent >= policy.max_concurrent_sessions:
            raise RateLimitExceededError(...)

        # 4. Remaining token budget
        if budgets.get("user_daily", float("inf")) <= 0:
            raise TokenBudgetExceededError("Daily budget exhausted.")

    async def check_execution_step(self, state, node_name, user):
        # 1. Step count limit
        self.execution_guard.check_step_limit(state, limits)

        # 2. Session duration
        self.execution_guard.check_session_duration(start_time, limits, session_id)

        # 3. State size
        self.execution_guard.check_state_size(state, limits)

        # 4. Loop detection
        loop_result = self.loop_detector.record_and_check(session_id, state, node_name)
        if loop_result and loop_result["severity"] == "high":
            raise LoopDetectedError(loop_result["description"])

    async def check_tool_call(self, tool_name, tool_args, state, user):
        # 1. Session-level tool limits
        self.execution_guard.check_tool_call_limits(state, tool_name, tool_args, limits)

        # 2. Tool-level rate limits
        allowed, retry_after = await tool_rate_limiter.check_and_acquire(
            tool_name=tool_name, session_id=session_id, user_id=user.user_id
        )
        if not allowed:
            raise ToolRateLimitError(...)

What I’ve Learned

The most important insight about agent rate limiting: it’s not the same as web application rate limiting, and treating it as such leaves large gaps.

Web rate limiting is primarily about frequency of incoming requests. Agent rate limiting has to cover that plus the autonomous execution within each session, plus the token economics that don’t exist in conventional applications. An agent that receives one request per minute can still cause enormous damage if each session runs for 200 steps consuming 500K tokens. The conventional rate limit at the gateway doesn’t see any of that.

The second thing I’ve learned: token budgets need to be both per-session and cumulative. A per-session limit without a daily cumulative limit lets an attacker run many short sessions. A daily cumulative limit without a per-session limit lets one runaway session exhaust the entire day’s budget. You need both.

And circuit breakers, while the most aggressive control, are the one I’ve come to appreciate most. They’re the safety net under everything else. If all the other controls fail simultaneously — unlikely but possible — the circuit breaker trips before the bill becomes catastrophic. That’s worth having.

Rate Limiting Checklist

Request rate limiting:

Per-user limits defined at minute, hour, and day granularity for all roles
Concurrent session limits defined per role
Distributed rate limiter (Redis) for multi-instance deployments
Lockout periods applied after rapid violations
Retry-After headers returned in rate limit responses

Session execution limits:

Maximum step count enforced per session per role
Maximum total tool calls and per-tool-type limits enforced
Duplicate tool call detection active
Maximum session duration with hard wall-clock cutoff
State size limits prevent unbounded accumulation

Loop detection:

Exact state fingerprint loop detection active
Semantic loop detection for approximate repetition
Node oscillation detection for A↔B cycling
Rolling window covers sufficient recent history

Token budgets:

Per-session budgets defined for all roles
Per-user-daily budgets tracked across sessions
Tenant-level budgets protect against distributed abuse
Soft limits trigger warnings before hard limits terminate

Circuit breakers:

Cost thresholds configured for minute, hour, and day
Circuit breaker state in distributed store
Recovery window defined and tested
Trips trigger immediate alerts to operations team

Tool-level limits:

High-impact tools have per-minute and per-hour limits
Concurrent call limits prevent thundering herd
Error cooldowns with exponential backoff on failures

Abuse detection:

Cross-session failure rate monitoring active
Injection attempt concentration detection active
Cross-session high-risk tool abuse tracked
Tenant-level coordination detection active

This is Part 14 of an ongoing series on LangGraph agent security. Previous posts: Part 1: Introduction · Part 2: Architecture Primer · Part 3: Attack Surface Analysis · Part 4: Core Threat Categories · Part 5: Threat Modeling · Part 6: Input Validation · Part 7: Tool Security · Part 8: State and Memory Security · Part 9: Multi-Agent Trust Boundaries · Part 10: Output Guardrails · Part 11: Authentication and Authorization · Part 12: Observability and Monitoring · Part 13: Human-in-the-Loop.