HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems
Background
Many production server systems (e.g., Cassandra, HBase, Hadoop) are migrated into cloud environments for lower upfront costs. However, when a production software bug is triggered in cloud environments, it is often difficult to diagnose and fix due to the lack of debugging information. Particularly, software hang bugs causing unresponsive or frozen systems instead of system crashing are extremely challenging to fix, which often cause prolonged service outages. For example, in 2015, Amazon DynamoDB experienced a five-hour service outage affecting many AWS customers including Netflix, Airbnb, and IMDb. The root cause of the service outage was a software hang bug where an improper error handling kept sending new requests to the overloaded metadata server, causing further cascading failures and retries. In 2017, British Airways experienced a serious service outage with a penalty of more than 100 million due to a software hang bug triggered by corrupted data during data center failover.
Unfortunately, software hang bugs are notoriously difficult to debug because they typically produce little diagnostic information. Recent studies have also shown that many hang bugs are caused by unexpected runtime data corruptions or inter-process communication failures, which makes those hang bugs particularly difficult to be caught during the testing phase. Although previous bug detection tools can detect those hang bugs, production service outage cannot be truly resolved until the hang bugs are correctly fixed. Otherwise, the service outage will happen again when the bug-triggering condition is met again in the production system.
Publications
Benchmark
- The following table shows 42 hang bugs fixed by HangFix. 14 of them are closed with manual patches.
- HangFix fixes 40 bugs, including both closed and open bugs.
Sponsors
- NSF CNS1513942 grant and NSF CNS1149445 grant