Back To Schedule
Wednesday, July 10 • 12:20pm - 12:40pm
IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

We address the problem of “fail-slow” fault, a fault where a hardware or software component can still function (does not fail-stop) but in much lower performance than expected. To address this, we built IASO, a peer-based, non-intrusive fail-slow detection framework that has been deployed for more than 1.5 years across 39,000 nodes in our customer sites and helped our customers reduce major outages due to fail-slow incidents. IASO primarily works based on timeout signals (a negligible overhead of monitoring) and converts them into a stable and accurate fail-slow metric. IASO can quickly and accurately isolate a slow node within minutes. Within a 7-month period, IASO managed to catch 232 fail-slow incidents in our large deployment field. In this paper, we have also assembled a large dataset of 232 fail-slow incidents along with our analysis. We found that the fail-slow annual failure rate in our field is 1.02%.


Biswaranjan Panda

Nutanix Inc.

Huan Ke

University of Chicago

Karan Gupta

Nutanix Inc.

Vinayak Khot

Nutanix Inc.

Haryadi S. Gunawi

University of Chicago

Wednesday July 10, 2019 12:20pm - 12:40pm PDT
USENIX ATC Track I: Grand Ballroom I–VI