Aperture (Reliability Management System)
  • is an open source reliability management system that can add these capabilities
  • it offers a centralized load management system that collects reliability-related metrics from different systems and uses it to generate a global view

Aperture - 3 components

  • Observe - Aperture collects reliability-related metrics (latency, resource utilization, etc.) from each node using a sidecar and aggregates them in Prometheus. You can also feed in metrics from other sources like InfluxDB, Docker Stats, Kafka, etc.
  • Analyze - A controller will monitor the metrics in Prometheus and track any deviations from the service-level objectives you set. You set these in a YAML file and Aperture stores them in etcd, a popular distributed key-value store.
  • Actuate - If any of the policies are triggered, then Aperture will activate configured actions like load shedding or distributed rate limiting across the system.