SysMD: Automatic 24x7
System
Health Monitoring, Diagnosis, and Alleviation for Large-Scale Hosting
Infrastructures
Background
Large-scale
hosting
infrastructures
have
become
important
platforms
for
many
systems
such
as
cloud computing, virtual computing lab,
data centers, and multi-tier web servers. However, system administrators are often overwhelmed by
the tasks of correcting various system health problems such as performance bottlenecks, resource
hotspots, service level objective (SLO) violations, and various software/hardware failures. We are
addressing this challenge through three synergistic techniques: i) intelligent information management
system that can adaptively and selectively collect important information and provide
query support with low monitoring cost; ii) context-aware distributed
anomaly prediction that can
raise advance alerts for impending system anomalies; and iii)
just-in-time anomaly
remediation and diagnosis tools that can dynamically alleviate
impending anomalies and produce
informative diagnosis
reports to the system administrator by applying time-traveling
executions and analysis
techniques on abnormal system components. Our research will lead to a
fundamentally new predictive
online health management approach that offers a more cost-effective
self-healing solution for
large-scale virtual
computing environment than previous reactive or proactive
approaches.
Projects
Publications
Jingzhu He, Ting Dai, and Xiaohui Gu,
"
TFix: Automatic Timeout Bug Fixing in Production Server Systems ",
Proc. of IEEE International Conference on Distributed Computing Systems
(ICDCS) , Dallas, Texas, July, 2019.
Ting Dai, Jingzhu He, Xiaohui Gu, Shan Lu, and Peipei Wang,
"
DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems ",
Proc. of ACM Symposium on Cloud Computing
(SOCC) , Carlsbad, CA, October, 2018.
Jingzhu He, Ting Dai, and Xiaohui Gu,
"
TScope: Automatic Timeout Bug Identification for Server Systems ",
Proc. of IEEE International Conference on Autonomic Computing
(ICAC) , Trento, Italy, September, 2018.
Ting Dai, Daniel Dean, Peipei Wang, Xiaohui Gu, Shan Lu,
"
Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures ",
IEEE Transactions on Parallel and Distributed Systems
(TPDS) , 2018 (view supplemental material here ).
Ting Dai, Jingzhu He, Xiaohui Gu, Shan Lu
"
Understanding Real-World Timeout Problems in Cloud Server Systems ",
Proc. of IEEE International Conference on Cloud Engineering
(IC2E) , Orlando, FL, April, 2018(acceptance rate: 19%, best
paper nominee ).
Ting Dai, Daniel Dean, Peipei Wang, Xiaohui Gu, Shan Lu
"
Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures ",
Proc. of ACM Symposium on Cloud Computing
(SOCC) , poster session, Santa Clara, CA, September, 2017.
Anwesha Das, Frank Mueller, Xiaohui Gu, Arun Iyengar
"
Performance Analysis of a Multi-Tenant In-memory Data Grid ",
Proc. of the IEEE International Conference on Cloud Computing
(IEEE CLOUD) , San Francisco, CA, June/July, 2016.
Peipei Wang, Hiep Nguyen, Xiaohui Gu, Shan Lu
"
RDE: Replay DEbuggging for Diagnosing Production Site Failures ",
Proc of IEEE International Symposium on Reliable Distributed Systems
(SRDS) , Budapest, Hungary, Sep. 26-29th, 2016.
Daniel J. Dean, Hiep Nguyen, Peipei Wang, Xiaohui Gu, Anca Sailer, Andrzej Kochut
"
PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds ",
IEEE Transactions on Parallel and Distributed Systems
(TPDS) , 2015.
Daniel J. Dean, Peipei Wang, Xiaohui Gu, William Enck, Guoliang Jin
"
Automatic Server Hang Bug Diagnosis: Feasible Reality or Pipe Dream? ",
Proc. of IEEE International Conference on Autonomic Computing
(ICAC) , Grenoble, France, July, 2015.(short paper, acceptance rate: 27%)
Peipei Wang, Daniel J. Dean, Xiaohui Gu
"
Understanding Real World Data Corruptions in Cloud Systems ",
Proc. of IEEE International Conference on Cloud Engineering
(IC2E) , Temp, AZ, March, 2015. (acceptance rate: 25%).
Daniel Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora and Geoff Jiang
"
PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures ",
Proc. of ACM Symposium on Cloud Computing
(SOCC) , Seattle, WA, November, 2014 (acceptance rate: 29/119 = 24%).
Daniel Dean, Hiep Nguyen, Peipei Wang and Xiaohui Gu
"
PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds ",
Proc. of USENIX Workshop on Hot Topics in Cloud Computing
(HotCloud) , Philadelphia, PA, June, 2014 (acceptance rate: 22/72 = 30.5%).
Hiep Nguyen, Daniel J. Dean, Kamal Kc and Xiaohui Gu
"
Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures ",
Proc. of USENIX Annual Technical Conference
(USENIX ATC) , Philadelphia, PA, June, 2014 (acceptance rate: 36/241 = 14.9%).
Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes
"
AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service ",
Proc. of USENIX International Conference on Autonomic Computing
(ICAC) , San Jose, CA, June, 2013 (full paper, acceptance rate: 16/73 = 21%).
Hiep Nguyen, Zhiming Shen, Yongmin Tan and Xiaohui Gu
"
FChain: Toward Black-box Online Fault Localization for Cloud Systems ",
Proc. of International Conference on Distributed Computing Systems
(ICDCS) , Philadelphia, PA, July, 2013 (acceptance rate: 61/464 = 13%).
Daniel Dean, Hiep Nguyen and Xiaohui Gu
"
UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems ",
Proc. of International Conference on Autonomic Computing
(ICAC) , San Jose, CA, September, 2012 (acceptance rate: 24%).
Yongmin
Tan,
Hiep
Nguyen,
Zhiming
Shen, Xiaohui Gu, Chitra Venkatramani and Deepak Rajan, "PREPARE:
Predictive Performance Anomaly Prevention for Virtualized Cloud
Systems ," Proc. of International
Conference on Distributed Computing Systems ( ICDCS ) , Macau, China, June, 2012
(acceptance rate: 71/515=13.8%, Best Paper Award ).
Hiep
Nguyen,
Yongmin
Tan and Xiaohui
Gu, "PAL:
Propagation-aware Anomaly Localization for Cloud Hosted Distributed
Applications , Proc. of ACM
Workshop on Managing Large-Scale Systems via the Analysis
of System Logs and the Application of Machine Learning Techniques ( SLAML )
in conjunction with SOSP ,
Cascais, Portugal, October,
2011.
Yongmin
Tan ,
Xiaohui Gu, ''On
Predictability of System Anomalies in Real World ", IEEE/ACM
International
Symposium
on
Modeling,
Analysis
and
Simulation
of
Computer and Telecommunication Systems (MASCOTS ), Miami
Beach,
Florida, August, 2010.
Yongmin Tan, Xiaohui Gu, Haixun Wang,
"Adaptive Runtime
Anomaly Prediction for Dynamic Hosting
Infrastructures ", ACM Symposium
on
Principles of Distributed Computing
(PODC ) , Zurich,
Switzerland, July, 2010. (acceptance rate: 21%)
Xiaohui Gu, Haixun Wang, "Online
Anomaly Prediction for Robust Cluster Systems ", IEEE International
Conference on Data Engineering (ICDE ) ,
Shanghai,
China,
2009.
(long
paper,
acceptance
rate:
93/554
= 17%)
Xiaohui Gu, Spiros Papadimitriou,
Philip S. Yu, Shu-Ping Chang, "Toward
Predictive
Failure
Management
for
Distributed
Stream
Processing
Systems ", IEEE
International Conference on
Distributed Computing Systems (ICDCS ) ,
Beijing,
China,
June,
2008.
(acceptance
rate:
102/638
=
16%)
Xiaohui Gu, Spiros Papadimitriou,
Philip S. Yu, Shu-Ping Chang, "Online
Failure
Forecast
for
Fault-Tolerant
Data
Stream
Processing ", IEEE
International Conference on Data Engineering (ICDE ) (poster paper),
Cancun, Mexico, April, 2008.
Sponsors
NSF CNS0915567 grant
NSF CNS0915861 grant
NSF CAREER Award CNS1149445
U.S. Army Research Office (ARO) under grant W911NF-10-1-0273
IBM Faculty Awards
Google Research Awards
Code & data release
Data used in the UBL paper can be downloaded here
Data used in the FChain paper can be downloaded here
SysMD v1.0 (available upon email
request)