Service-Level Objective-Aware Load-Adaptive Timeout: Balancing Failure Rate and Latency in Microservices Communication | Zendy

Hiroki Hanada | Zendy; Kiesuke Ishibashi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Service-Level Objective-Aware Load-Adaptive Timeout: Balancing Failure Rate and Latency in Microservices Communication

Author(s) -

Hiroki Hanada,

Kiesuke Ishibashi

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3596118

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Microservices architectures enable scalable and modular application design but introduce reliability challenges due to their distributed nature. Timeout configurations are critical for maintaining system reliability, as they directly impact latency and failure rate Service-Level Objective (SLO) compliance. However, current timeout settings are often based on best practices rather than systematic optimization, as determining the optimal timeout is challenging. The difficulty arises from the need to balance SLO constraints while adapting to dynamically changing load and system capacity, making static configurations inherently suboptimal. To address this, this study proposes a load-adaptive timeout mechanism that dynamically adjusts timeout values to optimize reliability across different load conditions. Under normal load, the method minimizes latency while maintaining failure rate SLO compliance. Under overload, where meeting both objectives becomes infeasible, it prioritizes failure rate reduction while ensuring latency SLO compliance. By allocating the initial portion of the timeout duration to transmission delay during downstream overload and failure, the method naturally exhibits load shedding and circuit-breaking behavior, preventing the bottleneck service from being overwhelmed. The proposed method was implemented as an open-source Go library and evaluated using the Online Boutique benchmark under various load conditions. Results show that it reduces average and tail latencies by 40% and 55%, respectively, under normal load and short-lived overload. Under prolonged overload, it minimizes failure rates, reducing deviations from the failure rate SLO by 18%. These findings demonstrate the effectiveness of adaptive timeout control in maintaining microservices reliability while dynamically responding to changing system conditions.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research