Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures


Background

Cloud computing infrastructures have become increasingly popular by allowing users to access computing resources in a cost-effective way. However, when a performance problem (e.g., software hang, performance slowdown) occurs in production cloud infrastructures, it is notoriously difficult to diagnose because the developer often has little diagnostic information (e.g., no error log or core dump) to localize the fault. A recent study has also shown that performance bugs widely exist across different server applications that are commonly used in production cloud environments. Previous work on performance bugs can be broadly classified into two groups: 1) static analysis schemes that detect bugs by searching specific performance anti-patterns in software, such as inefficient call sequences or loop patterns; and 2) dynamic runtime analysis schemes that closely monitor runtime application behaviors to infer root causes of performance problems.

Both approaches have advantages but also limitations. The static analysis approach imposes no runtime overhead to production systems. However, without run-time information and without focusing on the specific anomaly occurred in a production run, this approach inevitably suffers from excessive false alarms, reporting code regions that are unrelated to the production run performance problem. To address this problem, previous work proposed specialized rule checkers to detect specific and known performance bugs. However, specialized rule checkers cannot cover many real world performance bugs as shown in our experiments. In contrast, a dynamic approach can target the specific problem that has occurred in the production environment. However, it needs to perform monitoring on production systems, inevitably imposing overhead. To avoid excessive runtime overhead, previous research proposed performance diagnosis based on system-level metrics or events that can be easily collected with low overhead, such as CPU utilization, free memory, system calls, and performance-counter events. Unfortunately, without knowledge about program semantics, those dynamic techniques suffer from both false positives and false negatives too.

Publications

Benchmark

       Bug name        Root cause function Static rules
  unsafe func    non-updated loop exit var    const para    null para    uncovered branch 
  Apache-10038  main
 close_connect
 start_connect
 write_request
  Apache-17928*  ap_proxy_http_handler
  Apache-24403  apr_getnameinfo
 call_resolver
  Apache-36448*  ap_proxy_http_handler
  Apache-37680  child_main
 apr_socket_opt_set
 ap_setup_listeners
  Apache-38403  ap_proxy_initialize_worker
 ap_proxy_initialize_worker_share
  Apache-40883*  proxy_http_handler
  Apache-43238  ssl_hook_pre_connection
 ap_process_connection
  Apache-45856  err_output
 log_err
  Apache-47645  apr_pollcb_poll
 apr_pollset_poll
 apr_wait_for_io_or_timeout
  Apache-48029  apr_wait_for_io_or_timeout
  Apache-49882*  impl_pollset_poll
 apr_poll
 apr_wait_for_io_or_timeout
  Apache-53609  apr_table_mergen
  Cassandra-487*  getPoolStatistics
 deregisterAllVerbHandlers
  Cassandra-2187  validateSchemaIsSettled
  Cassandra-2240*  SSTableIdentityIterator
 doScrub
  Cassandra-2290  run
  Cassandra-2797*  createPendingFiles
 flushSSTables
 transferRanges
  Cassandra-2872  createColumnFamilyStore
  Cassandra-2933  forceTableRepair
  Cassandra-3302  MessagingService.listen
  Cassandra-3369*  getSampleIndexesForRanges
 prepare
  Cassandra-3520  forceFlush
  Cassandra-3543  recover
  Cassandra-3626  applyStateLocally
  Cassandra-3838*  connectAttempt
 IncomingStreamReader
  Cassandra-4492  computeNext
  Cassandra-5064  ColumnFamilyStore.reload
 maybeSwitchMemtable
  Cassandra-5229  IncomingStreamReader.streamIn
  Cassandra-5273*  CassandraDaemon.setup
 EmbeddedCassandraService.start
  Cassandra-5635  CustomTThreadPoolServer.serve
 MemtableCleanerThread.run
  Cassandra-6097*  forceRepairAsync
 forceRepairRangeAsync
 NodeCmd.optionalKSandCFs
  Cassandra-6175  StorageProxy.sendMessagesToNonlocalDC
  Cassandra-6210*  ConnectionHandler.initiate
  Cassandra-6603  waitForStreaming
 drain
  Cassandra-6735*  runMayThrow
 waitOnFutures
  Cassandra-7088  AbstractType.getString
  Cassandra-7330  drain
  Cassandra-7401  maybeUpdateLiveRatio
  Cassandra-7560*  MessagingService.listen
 MessagingService.shutdown
  Lighttpd-922  mod_proxy_core_start_backend
  Lighttpd-1084*  lighty_mainloop
 connection_state_machine
  Lighttpd-1178  proxy_http_stream_decoder
  Lighttpd-1212  fdevent_poll
  Lighttpd-1245  fcgi_handle_fdevent
  Lighttpd-1999  connection_handle_read_state
  Lighttpd-2197*  connection_handle_fdevent
  HDFS-415  writeBlock
 receiveBlock
  HDFS-723  close
  HDFS-724  run
  HDFS-1490*  doGetUrl
  HDFS-1692  DataXceiverServer.run
 Server.stop
  HDFS-2525  verifyBlock
  HDFS-3180*  WebHDFSFileSystem.create
 WebHDFSFileSystem.append
  HDFS-3318  SocketInputStream.read
 IOUtils.copyBytes
 SocketIOWithTimeout.doIO
  HDFS-3541  BlockReceiver.receiveBlock
 BlockReceiver.close
  HDFS-3754*  BlockSender.manageOsCache
    BlockSender.sendPacket
 BlockSender.sendBlock
 DataNode.startDataNode
  HDFS-4816  doGetUrl
  HDFS-4858  createNamenode
  HDFS-5016*  recoverRbw
 receiveBlock
  HDFS-5806  dispatch
  HDFS-5922*  reportReceivedDeletedBlocks
 offerService
  HDFS-6231  hedgedFetchBlockByteRange
  HDFS-6378  RpcProgram.register
  HDFS-6411*  access
 getattr
 fsstat
  Hadoop-1862  JobInProgress.updateTaskStatus
 TaskTracker.run
 ReduceTask.run
  Mapreduce-2489  verifyHostnames
  Mapreduce-3005  assignContainers
  Mapreduce-3058*  main
 map
 setup
  Mapreduce-3186  heartbeat
 run
 init
 getResources
  Mapreduce-3226  run
  Mapreduce-3228*  run
 startContainer
  Mapreduce-3339  getResources
 init
  Mapreduce-3355  run
  Mapreduce-3460*  ApplicationMasterService.start
 AppSchedulingInfo.allocate
  Mapreduce-3596*  NodeStatusUpdaterImpl.getNodeStatus
  Mapreduce-3714  run
  Mapreduce-3721  Shuffle.run
  Mapreduce-3738  AppLogAggregatorImpl.run
 AppLogAggregatorImpl.join
  Mapreduce-3862  stop
  Mapreduce-3896  getDelegationToken
  Mapreduce-3927*  computeProgress
 copySucceeded
  Mapreduce-4031  AsyncDispatcher.init
  Mapreduce-4062*  launch
 kill
 init
  Mapreduce-4152  getContainer
 launch
 kill
 stop
 run
  Mapreduce-4252*  JobImpl.getTasks
 JobImpl.scheduleTasks
  Mapreduce-4299  assignContainers
  Mapreduce-4733  constructTaskAttemptCompletionEvents
  Mapreduce-4751  handle
 transition
  Mapreduce-4842  startMerge
 waitForMerge
 run
  Mapreduce-4992  parse
  Mapreduce-5279*  preemptReducesIfNeeded
 scheduleReduces
 getResources
 assign
  Mapreduce-5489*  computeIgnoreBlacklisting
 containerFailedOnHost
  Memcached-106  event_handler
  MySQL-7858  ft_init_boolean_search
  MySQL-9459  reload_acl_and_cache
  MySQL-9992  dispatch_command
  MySQL-11832  innobase_close_connection
 row_search_for_mysql
 read_cursor_view_close_for_mysql
 read_cursor_view_create_for_mysql
  MySQL-12423*  reload_acl_and_cache
 grant_reload
 grant_init
 change_password
 acl_reload
 acl_init
 main
  MySQL-12739*  mysql_create_or_drop_trigger
 prepare_for_repair
 prepare_for_restore
 reopen_name_locked_table
  MySQL-13238  get_collation_number
 add_collation
  MySQL-17154  end_bulk_insert
 mysql_load
  MySQL-19047  main
 execute_impl
 connect
  MySQL-20575*  execute_impl
  MySQL-26938  fill_statistics_info
 show
 mysql_execute_command
  MySQL-28000*  write_record
  MySQL-29644  open_ltable
 store_lock
  MySQL-32436  val_int
  MySQL-32559  type
  MySQL-33414*  create
 mysql_execute_command
 Statement
 check_DDL_blocker
  MySQL-54332  fil_aio_wait
  MySQL-56715  rw_pr_init
  MySQL-65615  _mi_read_static_record
  Squid-643  aclMatchExternal
  Squid-1096  httpAccept
  Squid-1484  ipcCreate
  Squid-1968  idnsGrokReply
  Squid-1991*  comm_select
 commSetSelect
  Squid-2271  clientSendHeaders
  Squid-2425  strListGetItem
  Squid-2541  strListGetItem
  Squid-3084  prepareTransparentURL
  Squid-3134*  makeSpaceAvailable
 readSomeData
  Squid-3205  clientParseRequest
  Squid-3528  hostHeaderVerifyFailed
  Squid-3685  update
  Tomcat-42753  NioEndpoint$Poller.events
    NioEndpoint$Poller.run
  Tomcat-45453*  getPrincipal
 getRoles
  Tomcat-48470  unlockAccept
  Tomcat-50078  get
 put
  Tomcat-53173  countUpOrAwaitConnection
 countUpOrAwait
  Tomcat-53450  ContainerBase.fireContainerEvent
 StandardContext.createWrapper
 StandardContext.removeApplicationListener
  Tomcat-55177*  AbstractHttp11Processor.process


Sponsors