HangFix

HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems

Background

Many production server systems (e.g., Cassandra, HBase, Hadoop) are migrated into cloud environments for lower upfront costs. However, when a production software bug is triggered in cloud environments, it is often difficult to diagnose and fix due to the lack of debugging information. Particularly, software hang bugs causing unresponsive or frozen systems instead of system crashing are extremely challenging to fix, which often cause prolonged service outages. For example, in 2015, Amazon DynamoDB experienced a five-hour service outage affecting many AWS customers including Netflix, Airbnb, and IMDb. The root cause of the service outage was a software hang bug where an improper error handling kept sending new requests to the overloaded metadata server, causing further cascading failures and retries. In 2017, British Airways experienced a serious service outage with a penalty of more than 100 million due to a software hang bug triggered by corrupted data during data center failover.

Unfortunately, software hang bugs are notoriously difficult to debug because they typically produce little diagnostic information. Recent studies have also shown that many hang bugs are caused by unexpected runtime data corruptions or inter-process communication failures, which makes those hang bugs particularly difficult to be caught during the testing phase. Although previous bug detection tools can detect those hang bugs, production service outage cannot be truly resolved until the hang bugs are correctly fixed. Otherwise, the service outage will happen again when the bug-triggering condition is met again in the production system.

Publications

Jingzhu He, Ting Dai, Xiaohui Gu and Guoliang Jin,
"HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems",
Proc. of ACM Symposium on Cloud Computing (SOCC), Virtual Event, October, 2020.

Benchmark

The following table shows 42 hang bugs fixed by HangFix. 14 of them are closed with manual patches.
HangFix fixes 40 bugs, including both closed and open bugs.

#	Bug name	System version	Root cause pattern	Closed or open	Fixed
1	Cassandra-7330	v2.0.8	#1	closed	✔
2	Cassandra-9881	v2.0.8	#3	open	✔
3	Compress-87	v1.0	#1	closed	✔
4	Compress-451	v1.0	#2	closed	✔
5	Hadoop-8614	v0.23.0	#1	closed	✔
6	Hadoop-15088	v2.5.0	#1	open	✔
7	Hadoop-15415	v2.5.0	#2	open	✔
8	Hadoop-15417	v2.5.0	#2	open	✔
9	Hadoop-15424	v2.5.0	#1	open	✔
10	Hadoop-15425	v2.5.0	#1	open	✔
11	Hadoop-15429	v2.5.0	#2	open	✔
12	HDFS-4882	v0.23.0	#3	closed	✗
13	HDFS-5438	v0.23.0	#4	closed	✗
14	HDFS-10223	v2.7.0	#4	closed	✔
15	HDFS-13513	v2.5.0	#2	open	✔
16	HDFS-13514	v2.5.0	#2	open	✔
17	HDFS-14481	v2.5.0	#2	open	✔
18	HDFS-14501	v2.5.0	#2	open	✔
19	HDFS-14540	v0.23.0	#4	open	✔
20	Mapreduce-2185	v0.23.0	#3	closed	✔
21	Mapreduce-5066	v2.0.3	#4	open	✔
22	Mapreduce-6990	v0.23.0	#1	open	✔
23	Mapreduce-6991	v2.5.0	#3	open	✔
24	Mapreduce-7088	v2.5.0	#2	open	✔
25	Mapreduce-7089	v2.5.0	#2	open	✔
26	Yarn-163	v0.23.0	#1	open	✔
27	Yarn-1630	v2.2.0	#1	closed	✔
28	Yarn-2905	v2.5.0	#1	closed	✔
29	HBase-8389	v0.94.3	#1	closed	✔
30	Hive-5235	v1.0.0	#4	open	✔
31	Hive-13397	v1.0.0	#1	closed	✔
32	Hive-18142	v1.0.0	#1	open	✔
33	Hive-18216	v2.3.2	#3	open	✔
34	Hive-18217	v2.3.2	#3	open	✔
35	Hive-18219	v2.3.2	#1	open	✔
36	Hive-19391	v1.0.0	#4	open	✔
37	Hive-19392	v1.0.0	#2	open	✔
38	Hive-19395	v1.0.0	#2	open	✔
39	Hive-19406	v2.3.2	#4	open	✔
40	Kafka-6271	v0.10.0	#1	open	✔
41	Lucene-772	v2.1.0	#4	open	✔
42	Lucene-8294	v2.1.0	#2	closed	✔

HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems

Background

Publications

Benchmark

Sponsors