HangFix: Automatically Fixing Software Hang Bugs for Production Cloud Systems


Background

Many production server systems (e.g., Cassandra, HBase, Hadoop) are migrated into cloud environments for lower upfront costs. However, when a production software bug is triggered in cloud environments, it is often difficult to diagnose and fix due to the lack of debugging information. Particularly, software hang bugs causing unresponsive or frozen systems instead of system crashing are extremely challenging to fix, which often cause prolonged service outages. For example, in 2015, Amazon DynamoDB experienced a five-hour service outage affecting many AWS customers including Netflix, Airbnb, and IMDb. The root cause of the service outage was a software hang bug where an improper error handling kept sending new requests to the overloaded metadata server, causing further cascading failures and retries. In 2017, British Airways experienced a serious service outage with a penalty of more than 100 million due to a software hang bug triggered by corrupted data during data center failover.

Unfortunately, software hang bugs are notoriously difficult to debug because they typically produce little diagnostic information. Recent studies have also shown that many hang bugs are caused by unexpected runtime data corruptions or inter-process communication failures, which makes those hang bugs particularly difficult to be caught during the testing phase. Although previous bug detection tools can detect those hang bugs, production service outage cannot be truly resolved until the hang bugs are correctly fixed. Otherwise, the service outage will happen again when the bug-triggering condition is met again in the production system.

Publications


Benchmark

  #            Bug name           System version       Root cause pattern       Closed or open    Fixed
  1   Cassandra-7330  v2.0.8 #1 closed
  2   Cassandra-9881  v2.0.8 #3 open
  3   Compress-87  v1.0 #1 closed
  4   Compress-451  v1.0 #2 closed
  5   Hadoop-8614  v0.23.0 #1 closed
  6   Hadoop-15088  v2.5.0 #1 open
  7   Hadoop-15415  v2.5.0 #2 open
  8   Hadoop-15417  v2.5.0 #2 open
  9   Hadoop-15424  v2.5.0 #1 open
  10   Hadoop-15425  v2.5.0 #1 open
  11   Hadoop-15429  v2.5.0 #2 open
  12   HDFS-4882  v0.23.0 #3 closed
  13   HDFS-5438  v0.23.0 #4 closed
  14   HDFS-10223  v2.7.0 #4 closed
  15   HDFS-13513  v2.5.0 #2 open
  16   HDFS-13514  v2.5.0 #2 open
  17   HDFS-14481  v2.5.0 #2 open
  18   HDFS-14501  v2.5.0 #2 open
  19   HDFS-14540  v0.23.0 #4 open
  20   Mapreduce-2185  v0.23.0 #3 closed
  21   Mapreduce-5066  v2.0.3 #4 open
  22   Mapreduce-6990  v0.23.0 #1 open
  23   Mapreduce-6991  v2.5.0 #3 open
  24   Mapreduce-7088  v2.5.0 #2 open
  25   Mapreduce-7089  v2.5.0 #2 open
  26   Yarn-163  v0.23.0 #1 open
  27   Yarn-1630  v2.2.0 #1 closed
  28   Yarn-2905  v2.5.0 #1 closed
  29   HBase-8389  v0.94.3 #1 closed
  30   Hive-5235  v1.0.0 #4 open
  31   Hive-13397  v1.0.0 #1 closed
  32   Hive-18142  v1.0.0 #1 open
  33   Hive-18216  v2.3.2 #3 open
  34   Hive-18217  v2.3.2 #3 open
  35   Hive-18219  v2.3.2 #1 open
  36   Hive-19391  v1.0.0 #4 open
  37   Hive-19392  v1.0.0 #2 open
  38   Hive-19395  v1.0.0 #2 open
  39   Hive-19406  v2.3.2 #4 open
  40   Kafka-6271  v0.10.0 #1 open
  41   Lucene-772  v2.1.0 #4 open
  42   Lucene-8294  v2.1.0 #2 closed


Sponsors