DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems
Background
Cloud server systems such as Hadoop and Cassandra have enabled many real-world
data-intensive applications ranging from security attack detection to business intelligence.
However, due to their inherent complexity, those cloud server systems present many performance challenges.
Particularly, previous studies have shown that many tricky performance bugs in cloud server systems are caused
by unexpected data corruptions which are more likely to be overlooked by the developer. For example, in May 2017, a data corruption bug triggered in a data center failover operation brought down the British Airway service for hours.
Performance bugs
are notoriously difficult to debug because they typically produce little useful debugging information.
The problem exacerbates in cloud server systems since the developer typically does
not have the access to the original input data that triggered the performance bug or the large scale infrastructure
to replay the failed production run.
Although previous work has extensively studied data corruptions and performance bugs,
little research has been done to study the intersection between the two, that is, the performance problems caused by data corruptions.
Particularly, our work focuses on detecting software hang bugs that are triggered by data corruptions in cloud server systems.
Software hang bugs make the system become unavailable to either part of or all of the users, which is one of the most severe performance problems production systems try to avoid.
Publications
- Ting Dai, Jingzhu He, Xiaohui Gu, Shan Lu, and Peipei Wang,
"DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems",
Proc. of ACM Symposium on Cloud Computing
(SOCC), Carlsbad, CA, October, 2018.
[poster]
[slides]
Benchmark
- The following table shows 42 data corruption hang bugs detected by DScope.
- DScope detects 29 newly discovered bugs which are filed in JIRA by us.
Sponsors
- NSF CNS1513942 grant and NSF CNS1149445 grant