All Stories

  1. L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
  2. A Survey on Failure Analysis and Fault Injection in AI Systems
  3. COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge
  4. Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis
  5. FaaSConf: QoS-aware Hybrid Resources Configuration for Serverless Workflows
  6. CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement Learning
  7. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
  8. DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
  9. MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry
  10. DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core Systems
  11. LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly
  12. Fighting against Incidents in Large-Scale Online Systems
  13. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments