In distributed systems, failures are inevitable. What matters is ensuring that the entire system does not go down when a failure occurs. In this part, we cover failure response patterns using Resilience4j.
Topics covered in Part 4:
Preventing failure propagation with Circuit Breaker
Microservice Environment:[Client] ──▶ [API Gateway] ──▶ [OrderService] │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ [PaymentService] [InventoryService] [EmailService] │ │ │ ▼ ▼ ▼ [External PG] [DB] [SMTP Server]→ What happens if any one of these slows down or dies?
Failures are guaranteed to happen:
Failure Type
Example
Frequency
Network Latency
Timeouts, packet loss
Very common
Service Down
OOM, deployment failure
Common
Dependency Failure
DB connection pool exhaustion, external API down
Common
Resource Exhaustion
CPU 100%, disk full
Occasional
1.2 Cascading Failure
1. EmailService becomes slow (5s response time)2. OrderService waits when calling EmailService ┌─────────────────────────────────────────┐ │ OrderService Thread Pool (20 threads) │ │ │ │ [Wait] [Wait] [Wait] [Wait] [Wait] │ │ [Wait] [Wait] [Wait] [Wait] [Wait] │ │ [Wait] [Wait] [Wait] [Wait] [Wait] │ │ [Wait] [Wait] [Wait] [Wait] [Wait] │ │ │ │ → All threads waiting for EmailService │ └─────────────────────────────────────────┘3. Cannot process new order requests → OrderService goes down too4. Other services depending on OrderService are also affected→ A single slow service brings down the entire system
1.3 Goals of Resilience Patterns
Goal
Description
Fault Isolation
A failure in one service does not propagate to others
Fast Failure
A quick error is better than a slow response
Graceful Degradation
Core functionality works even if some features are unavailable
Automatic Recovery
Automatically restores when the failed service recovers
2. Circuit Breaker Pattern
2.1 Named After an Electrical Circuit Breaker
Real electrical circuit breaker:Overcurrent detected → Breaker trips → Prevents fireSoftware Circuit Breaker:Failure detected → Calls blocked → System protected
2.2 Three States
Failure rate < threshold ┌───────────────────┐ │ │ ▼ │ ┌─────────┐ │ │ CLOSED │──────────────┘ │ (Normal) │ └────┬────┘ │ Failure rate >= threshold ▼ ┌─────────┐ │ OPEN │ ← All requests fail immediately │(Blocked) │ └────┬────┘ │ Wait duration elapsed ▼ ┌─────────┐ │HALF-OPEN│ ← Only some requests allowed │ (Test) │ └────┬────┘ │ ┌───────┴───────┐ │ │ High success Continued rate failures │ │ ▼ ▼ CLOSED OPEN
2.3 Project Configuration
# application.ymlresilience4j: circuitbreaker: instances: orderService: sliding-window-size: 10 # Based on the last 10 requests failure-rate-threshold: 50 # OPEN when 50% or more fail wait-duration-in-open-state: 10s # HALF-OPEN after 10 seconds permitted-number-of-calls-in-half-open-state: 3 # 3 test requests slow-call-duration-threshold: 2s # Calls over 2s are considered slow slow-call-rate-threshold: 50 # OPEN when 50% or more are slow ignore-exceptions: - com.example.marketplace.common.BusinessException # Ignore business exceptions
Configuration explained:
Setting
Meaning
sliding-window-size: 10
Tracks success/failure of the last 10 requests
failure-rate-threshold: 50
Trips when 5 or more out of 10 fail
wait-duration-in-open-state: 10s
Tests after 10 seconds of being tripped
slow-call-duration-threshold: 2s
Calls exceeding 2s are considered “slow”
ignore-exceptions
BusinessException is not counted as a failure
2.4 Code Implementation
// OrderService.kt@CircuitBreaker(name = "orderService", fallbackMethod = "createOrderFallback")fun createOrder(buyerId: Long, req: CreateOrderRequest): OrderResponse { // Normal logic return OrderResponse.from(savedOrder)}// Fallback called when circuit is openprivate fun createOrderFallback( buyerId: Long, req: CreateOrderRequest, ex: Throwable): OrderResponse { log.error("Circuit breaker fallback: ${ex.message}") throw BusinessException(ErrorCode.SERVICE_UNAVAILABLE)}
Problem scenario:┌─────────────────────────────────────────┐│ Malicious user or buggy client ││ ││ Generating 10,000 requests per second ││ │ ││ ▼ ││ ┌─────────────────┐ ││ │ Server overload │ → Normal users ││ │ Response delay │ affected too ││ │ Out of memory │ ││ └─────────────────┘ │└─────────────────────────────────────────┘
3.2 Resilience4j RateLimiter Options
Resilience4j provides a RateLimiter based on the token bucket algorithm.
Core Configuration Options
Option
Description
Default
limitForPeriod
Number of requests allowed per period
50
limitRefreshPeriod
Period at which permissions (tokens) are refreshed
500ns
timeoutDuration
Wait time for permission acquisition (0 means immediate rejection)
5s
Detailed Configuration
resilience4j: ratelimiter: instances: orderCreation: limit-for-period: 10 # Allow 10 requests per period limit-refresh-period: 1s # Refill tokens every 1 second timeout-duration: 0s # Reject immediately without waiting
timeout-duration: 0s (immediate rejection)─────────────────────────────────Request 11 arrives → No token → Immediate RequestNotPermitted exceptiontimeout-duration: 5s (wait up to 5 seconds)─────────────────────────────────Request 11 arrives → No token → Wait up to 5 seconds └── If token refilled within 5s → Request processed └── If still no token after 5s → RequestNotPermitted exception
Per-User Rate Limiting (Advanced)
// Apply different RateLimiter per IP or user IDfun getRateLimiterForUser(userId: String): RateLimiter { return rateLimiterRegistry.rateLimiter( "user-$userId", RateLimiterConfig.custom() .limitForPeriod(10) .limitRefreshPeriod(Duration.ofSeconds(1)) .timeoutDuration(Duration.ZERO) .build() )}
# application.ymlresilience4j: ratelimiter: instances: default: limit-for-period: 100 # Allow 100 per second limit-refresh-period: 1s # Reset every second timeout-duration: 0s # Reject immediately without waiting orderCreation: limit-for-period: 10 # Only 10 order creations per second limit-refresh-period: 1s timeout-duration: 0s
3.4 Using Resilience4j in a Spring Filter
// RateLimitingFilter.kt@Componentclass RateLimitingFilter( private val rateLimiterRegistry: RateLimiterRegistry) : OncePerRequestFilter() { override fun doFilterInternal( request: HttpServletRequest, response: HttpServletResponse, filterChain: FilterChain ) { // Apply different Rate Limiter based on request path val rateLimiterName = determineRateLimiter(request) val rateLimiter = rateLimiterRegistry.rateLimiter(rateLimiterName) if (rateLimiter.acquirePermission()) { filterChain.doFilter(request, response) // Allowed } else { handleRateLimitExceeded(response) // 429 response } } private fun determineRateLimiter(request: HttpServletRequest): String { return when { // Stricter limit for order creation API request.requestURI.startsWith("/api/v1/orders") && request.method == "POST" -> "orderCreation" else -> "default" } }}
3.5 Response Example
HTTP/1.1 429 Too Many RequestsContent-Type: application/json{ "success": false, "code": "RATE_LIMITED", "message": "Too many requests. Please try again later."}
3.6 Rate Limiting Algorithm Details
1) Fixed Window
Counting by dividing time into fixed intervalslimit: 10 per second00:00:00 ~ 00:00:01 (Window 1)├── Requests 1~10: ✅ Allowed└── Request 11: ❌ Rejected00:00:01 ~ 00:00:02 (Window 2)├── Counter reset└── Requests 1~10: ✅ AllowedProblem: Burst at window boundary────────────────────────────────────10 requests at 00:00:00.9 ✅10 requests at 00:00:01.1 ✅→ 20 requests pass in 0.2 seconds (2x the intended limit!)
1. Burst Allowance - Real traffic is uneven - Handles momentary request spikes naturally2. Implementation Efficiency - Only manages token count with AtomicInteger - No need to store request history3. Intuitive Configuration - "10 per second" = limit-for-period: 10, limit-refresh-period: 1s - Easy to understand
4. Bulkhead Pattern
4.1 Named After Ship Bulkheads
Ship structure:┌─────┬─────┬─────┬─────┐│ │ │ │ ││Comp1│Comp2│Comp3│Comp4││ │ │ │ │└─────┴─────┴─────┴─────┘ │ └── Even if one compartment floods, others remain safeSoftware Bulkhead:┌─────────────────────────────────────────┐│ Thread Pool Isolation ││ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ Order │ │ Product │ │ Payment │ ││ │Processing│ │ Query │ │Processing│ ││ │20 threads│ │30 threads│ │10 threads│ ││ └─────────┘ └─────────┘ └─────────┘ ││ │ ││ └── Even if order processing slows ││ down, product queries are ││ unaffected │└─────────────────────────────────────────┘
4.2 Project Configuration
# application.ymlresilience4j: bulkhead: instances: orderService: max-concurrent-calls: 20 # Max 20 concurrent calls max-wait-duration: 0s # Reject immediately without waiting
Config: max-concurrent-calls = 20Current state:┌─────────────────────────────────────────┐│ OrderService Bulkhead ││ ││ Processing: [1] [2] [3] ... [18] [19] [20] ││ ││ Slots: 20/20 in use │└─────────────────────────────────────────┘New request #21 arrives:→ max-wait-duration: 0s so immediately rejected→ BulkheadFullException thrown→ Fallback invoked or 503 Service Unavailable
5. Retry Pattern
5.1 Handling Transient Failures
Temporary network disconnection:Request 1: ❌ Failure (momentary network outage)Request 2: ✅ Success (recovered after 0.5s)→ Retrying can lead to success
5.2 Project Configuration
# application.ymlresilience4j: retry: instances: orderService: max-attempts: 3 # Max 3 attempts wait-duration: 500ms # 500ms between retries retry-exceptions: - java.io.IOException # Only retry on network errors - java.util.concurrent.TimeoutException
Problem scenario:┌─────────────────────────────────────────┐│ 1st attempt: Create order request ││ Saved to DB ││ Network drops during ││ response return ││ ││ 2nd attempt: Same request retried ││ Saved to DB again → ││ Duplicate order!! │└─────────────────────────────────────────┘Solution: Use an Idempotency KeyPOST /api/v1/ordersIdempotency-Key: abc-123-def→ When requesting with the same key, returns previous result (no new creation)
5.5 Retry vs Circuit Breaker
Situation
Retry
Circuit Breaker
Transient failure
Can succeed with retry
-
Persistent failure
Keeps failing, wastes resources
Fails fast, protects system
Combined
Retry first → Failures accumulate → Circuit Breaker trips