SysMD: Automatic 24x7 System Health Monitoring, Diagnosis, and Alleviation for Large-Scale Hosting Infrastructures


Large-scale hosting infrastructures have become important platforms for many systems such as cloud computing, virtual computing lab, data centers, and multi-tier web servers. However, system administrators are often overwhelmed by the tasks of correcting various system health problems such as performance bottlenecks, resource hotspots, service level objective (SLO) violations, and various software/hardware failures. We are addressing this challenge through three synergistic techniques: i) intelligent information management system that can adaptively and selectively collect important information and provide query support with low monitoring cost; ii) context-aware distributed anomaly prediction that can raise advance alerts for impending system anomalies; and iii) just-in-time anomaly remediation and diagnosis tools that can dynamically alleviate impending anomalies and produce informative diagnosis reports to the system administrator by applying time-traveling executions and analysis techniques on abnormal system components. Our research will lead to a fundamentally new predictive online health management approach that offers a more cost-effective self-healing solution for large-scale virtual computing environment than previous reactive or proactive approaches. 




Code & data release