SysMD: Automatic 24x7 System
Health Monitoring, Diagnosis, and Alleviation for Large-Scale Hosting
Infrastructures
Background
Large-scale
hosting infrastructures
have become important platforms for many systems such as cloud computing, virtual computing lab,
data centers, and multi-tier web servers. However, system administrators are often overwhelmed by
the tasks of correcting various system health problems such as performance bottlenecks, resource
hotspots, service level objective (SLO) violations, and various software/hardware failures. We are
addressing this challenge through three synergistic techniques: i) intelligent information management
system that can adaptively and selectively collect important information and provide
query support with low monitoring cost; ii) context-aware distributed
anomaly prediction that can
raise advance alerts for impending system anomalies; and iii)
just-in-time anomaly
remediation and diagnosis tools that can dynamically alleviate
impending anomalies and produce
informative diagnosis
reports to the system administrator by applying time-traveling
executions and analysis
techniques on abnormal system components. Our research will lead to a
fundamentally new predictive
online health management approach that offers a more cost-effective
self-healing solution for
large-scale virtual
computing environment than previous reactive or proactive
approaches.
People
Faculty
Students
Collaborators
- Brent Miller (IBM RTP)
- Mike Wamboldt (IBM RTP)
- Haixun Wang (IBM Research)
- Spiros Papadimitriou (IBM Research)
Publications
- Xiaohui Gu, Haixun Wang, "Online
Anomaly Prediction for Robust Cluster Systems", IEEE International
Conference on Data Engineering (ICDE), Shanghai, China, 2009. (long
paper, acceptance rate: 93/554 = 17%)
- Xiaohui Gu, Spiros Papadimitriou,
Philip S. Yu, Shu-Ping Chang, "Toward
Predictive Failure Management for Distributed Stream Processing Systems",
IEEE International Conference on Distributed Computing Systems (ICDCS),
Beijing, China, June, 2008. (acceptance rate: 102/638 = 16%)
- Xiaohui Gu, Spiros Papadimitriou,
Philip S. Yu, Shu-Ping Chang, "Online
Failure Forecast for Fault-Tolerant Data Stream Processing", IEEE
International Conference on Data Engineering (ICDE) (poster paper),
Cancun, Mexico, April, 2008.
Code
Release
- SysMD v1.0 (available upon email
request)