Fault Tolerance is a Requirement, Not a Feature

APIs interacts with dozens of systems in service-oriented architecture, which makes the API inherently more vulnerable to any system failures or latencies underneath it in the stack

Principles of Resiliency

Here are some of the key principles that informed our thinking as we set out to make the API more resilient.

A failure in a service dependency should not break the user experience for members
The API should automatically take corrective action when one of its service dependencies fails
The API should be able to show us what’s happening right now, in addition to what was happening 15–30 minutes ago, yesterday, last week, etc

Fallback Triggers

goes a little further than the basic CircuitBreaker pattern in that fallbacks can be triggered in a few ways:

A request to the remote service times out
The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
The client library used to interact with a service dependency throws an exception

These buckets of failures factor into a service’s overall error rate and when the error rate exceeds a defined threshold then we “trip” the circuit for that service and immediately serve fallbacks without even attempting to communicate with the remote service

Fallback Types

Each service that’s wrapped by a circuit breaker implements a fallback using one of the following three approaches:

Custom fallback — in some cases a service’s client library provides a fallback method we can invoke, or in other cases we can use locally available data on an API server (eg, a cookie or local JVM cache) to generate a fallback response
Fail Silent — in this case the fallback method simply returns a null value, which is useful if the data provided by the service being invoked is optional for the response that will be sent back to the requesting client
Fail Fast — used in cases where the data is required or there’s no good fallback and results in a client getting a 5xx response. This can negatively affect the device UX, which is not ideal, but it keeps API servers healthy and allows the system to recover quickly when the failing service becomes available again.

Ideally, all service dependencies would have custom fallbacks as they provide the best possible user experience (given the circumstances). Although that is our goal, it’s also very challenging to maintain complete fallback coverage for many service dependencies. So the fail silent and fail fast approaches are reasonable alternatives

Fault Tolerance Approaches

Implement a solution that uses a combination of fault tolerance approaches:

network timeouts and retries
separate threads on per-dependency thread pools
semaphores (via a tryAcquire, not a blocking call)
circuit breakers

Each of these approaches to fault-tolerance has pros and cons but when combined together provide a comprehensive protective barrier between user requests and underlying dependencies

Frameworks

Resilience Frameworks

／var／log marcus chiu

Explorer

Fault／Latency Tolerance & Resilience (Service-Dependency Layer)

Principles of Resiliency

Fallback Triggers

Fallback Types

Fault Tolerance Approaches

Frameworks

Resources

／var／logmarcus chiu

Explorer

Fault／Latency Tolerance & Resilience (Service-Dependency Layer)

Principles of Resiliency

Fallback Triggers

Fallback Types

Fault Tolerance Approaches

Frameworks

Resources

／var／log marcus chiu