Real-World Timeout Bug Identification and Fixing in Production Server Systems


Background

Cloud server systems (e.g., Hadoop, Cassandra, HDFS, Spark) have become increasingly complex, which often consist of many inter-dependent components. It is challenging to achieve reliability in cloud server applications because 1) different components need to communicate frequently with each other via unreliable networks and 2) individual component may fail at any time. Timeout is one of the commonly used mechanisms to handle unexpected failures in distributed com- puting environments. Timeout mechanism can be used in both intra-node and inter-node communication failover. For example, when a component C1 sends a request to another component C2, C1 sets a timer timeout value and waits for the response from C2 until the timer expires. In case C2 fails or a message loss occurs, C1 can break out of the waiting state triggered by the timeout event and take proper actions (e.g., retrying or skipping) accordingly.

However, many real-world cloud server applications lack proper configuration and handling of those timeout events. As the scale of server applications grow, the likelihood of timeout bugs also increases. In 2015, Amazon DynamoDB service was down for five hours. The service outage is caused by a timeout bug in the metadata server. When the metadata server was already overloaded, the new requests from storage servers to the metadata server failed due to timeout. Storage servers kept retrying, causing further failures and retries, creating a cascading failure.

Publications


  #            Bug name           System Version       Missing or Misused Timeout       Classification       Timeout Affected Functions       Misused Timeout Variable       Fixed   
  1   Hadoop-9106  v2.0.3-alpha Misused Client.setupConnection() ipc.client.connect.timeout
  2   Hadoop-11252  v2.6.4 Misused RPC.getProtocolProxy() ipc.client.rpc-timeout.ms
  3    HDFS-4301  v2.0.3-alpha Misused TransferImage.doGetUrl() dfs.image.transfer.timeout
  4   HDFS-10223  v2.8.0 Misused DFSUtilClient.peerFromSocketAndKey() dfs.client.socket-timeout
  5   MapReduce-6263  v2.7.0 Misused YARNRunner.killJob() yarn.app.mapreduce.am.hard-kill-timeout-ms
  6   MapReduce-4089  v2.7.0 Misused TaskHeartbeatHandler.PingChecker.run() mapreduce.task.timeout
  7   HBase-15645  v1.3.0 Misused RpcRetryingCaller.callWithRetries() hbase.client.operation.timeout
  8   HBase-17341  v1.3.0 Misused ReplicationSource.terminate() replication.source.maxretriesmultiplier
  9   Hadoop-11252  v2.5.0 Missing -- -- --
  10    HDFS-1490  v2.0.2-alpha Missing -- -- --
  11   MapReduce-5066  v2.0.3-alpha Missing -- -- --
  12   Flume-1316  v1.1.0 Missing -- -- --
  13   Flume-1819  v1.3.0 Missing -- -- --


Sponsors