In the course of their day-to-day work, our development team actively relies on our metrics platform to confidently ship code to production and debug problems. They measure and correlate behavior between services on live production workloads, use real-time data to reason and hypothesize about production problems, and add or modify metrics and instrumentation in production to prove out their assumptions. Our own success in utilizing the metrics stream from production to close our engineering feedback loop, has convinced us that this, practice, which we describe as Metrics Driven Development (MDD), is a requirement of building web-scale systems. It is a discipline that should be implemented by development teams alongside other development paradigms like Test-driven-development (TDD) and Behavior-Driven-Development (BDD).
Our talk will recount an episode where we employed MDD to diagnose an actual problem encountered in our production system running at scale. The audience will follow as the developer initially identified an anomaly in a production KPI metric, developed a hypothesis as to the cause of the anomaly, added instrumentation to the code in question and finally confirmed the original hypothesis through observation of real-time metrics. Along the way we’ll include references to specific tools and best practices that developers can adopt in their own MDD efforts. We’ll also demonstrate that MDD does not replace traditional debugging approaches like request logging or code profiling, but can often help narrow the focus of those efforts, which can be expensive or difficult to perform in web-scale systems.
This talk is a synthesis of cultural transformation, concrete engineering techniques, systems monitoring, scientific observation, and post-mortem. It will prove intellectually gratifying and valuable to anyone who is writing and shipping code to production systems, even if they are already following an MDD model. They’ll learn what requirements a metrics platform needs to support MDD, how to add lightweight instrumentation to code, and how to isolate problems by using metrics derived from that instrumentation. The audience will also see how MDD can be used in addition to traditional production debugging practices, and will come away with an understanding of how to ship better software through the use of MDD.
As applications move to a SaaS deployment model, the scale of deployments increases, often by orders of magnitude. At the same time, competitive pressures call for the use of continuous deployment and the integration of development and operations functions to bring the frequency of application updates from months to days or even hours. Monitoring solutions that were sufficient for on-premise applications do not meet the needs of web-scale applications in terms of robustness, scalability, elasticity, and flexibility for dealing with custom metrics.
To understand and respond to problems, enterprises must work toward an integrated monitoring and alerting system that can ingest data from a variety of sources and understand the normal range of behaviors for different applications and deployment modes to reduce the signal-to-noise ratio and make operations staff more productive.
Join Gigaom Research and Librato for “Monitoring and Metrics for Web-Scale Applications,” a free analyst roundtable webinar that will help businesses understand the impact of the cloud on operations and help take steps toward a more efficient, powerful monitoring and management system.
Monitoring the health of mission critical applications and websites is key to maintaining uptime and ensuring your deployments are functioning as intended. In this webinar, you learn about Librato’s solution for real-time application monitoring, alerting and flexible correlation analysis designed to seamlessly integrate with Amazon CloudWatch and provide a real-time view of your AWS infrastructure and application metrics.
Webinar topics include:
Correlating Amazon CloudWatch and custom metrics, adding asynchronous event markers and setting alerts to create actionable dashboards.
Integrating other existing tools with Librato for collaboration and escalation workflow.
How StatusPage.io uses Librato to correlate AWS infrastructure and application metrics for root cause analysis.
Kyle Lichtenberg, Solutions Architect, Amazon Web Services
Joe Ruscio, CTO and Co-Founder, Librato
Scott Klein, Co-Founder, StatusPage.io