Throw away your monitoring, it’s metrics that matter

Monitoring is not given the time and effort it deserves. Instead it’s seen as an afterthought, a tech debt item with very little importance. This is strange. Monitoring is quite literally the only way you’re going to know you have a problem with your system before your users/customers do. It’s your gateway to accurate scaling (Otherwise known as not giving your accounting team sudden and explosive heart failure with AWS Bills). It is in fact the only way you can measure the digital pulse of your system. Weirdly the opposite appears to be true of the more ‘business’ side of things. Sales figures, spending habits, expenses are all put through the wringer with many and varied people getting highly excited when a figure starts to trend up down, or in some cases, sideways. Not so in the cosy world of technology where it seems most people will slap in some state detection and call it a day. ‘Metrics?’ They sneer in surprised puzzlement ‘What on earth do we need those for?’. In the realm of startups, disrupters and early adopters, metrics are the new normal, but in the risk averse enterprise world it’s still not caught on.

Here’s the dirty little secret; metrics are the only way to ensure your service is a success.

It should be your only window into the state of your tech stack. Take state detection for an example; at it’s most basic, state detection is a metric, albeit a metric that simply fluctuates between 0 and 1 (Or False or True). That’s a metric. It can be charted, alerted on and interrogated. Changes to a file? Metric. CPU usage? Obvious metric. Systems such as Sensu and Icinga have recognized that metrics are important, but don’t make them front and center (yet). Instead the focus is still on state over metrics..

Screens (laptop/desktop) displaying metrics — Metrics are the only way to ensure your service is a success (Bitergia dashboards)

The need for measurable systems is slowly being realised among the more forward thinking techies. Most startups now use AWS and live and breath Cloud Watch for their monitoring. Cloud Watch is essentially just a Time Series database, so these folk are ahead of the curve; if they realise it or not, they’re already living the metrics dream. ‘Legacy’ or ‘Corporate’ systems on the other hand, not so much. I remember with a shudder some of the conversations I’ve had with clients, ranging from huge telecoms companies, through to finance that simply go with the state change bandwagon. In fact, in one spectacular case, the reply to ‘Do you have monitoring?’ Was a reflective pause, a grunt and ‘Well sure; we have customers who call when it goes wrong’. You can imagine the delicacy with which I chose my reply.

To be fair, tooling has lagged behind, but this is no longer an issue with, excellent tools such as InfluxDB, Elastic and even Splunk taking the time series bandwagon for a ride around the block. Again, when it comes to presenting data there is a wide and varied set of tools to choose from, ensuring that no matter how shiny you need that graph to be (And I notice the higher up the C levels you go the more inexplicably shiny people want to make them) there is a tool that can do it. With tooling solved, that leaves a perceived knowledge gap. Right now it’s a bafflingly coming pattern to find ‘monitoring teams’. These fine folk sit in remote locations tweaking esoteric tools into producing more and more irrelevant data and giving business the comfortable illusion of knowing what’s going on. I’m sure it was cold comfort to BA to know their systems were down, but it probably would have been more relevant to note that something more fundamental was wrong with the platform before a power outage triggered the issue. Interestingly, the skills already exist within development teams to enable this. Most teams have highly skilled QA or test engineers who know the intricacies of their systems in fundamentally detailed ways. The App developers can easily add additional monitoring points, and the DevOps engineers can give chapter and verse on monitoring the state of the app. Generally speaking, this knowledge is getting dusty in the brains of the app delivery teams because their ability to push new monitoring to an external team is limited.

A new movement is gaining pace that coalesces the practices and people that already preach this heady form of monitoring, and brings our test and QA colleagues into the team as key members. My fine colleague Richard Donovan has established the term MetOps – a nice mashup of of ‘Metrics’ And ‘Operations’. Metric led Operations would be more accurate, but MetLedOPs is a bit lengthy for today’s attention deficit. This would give us all an umbrella term to gather under. It’ll need to be one of those insanely big golf umbrellas as this is going to be huge. In part two we’ll examine how you can start to build MetOps into your team, and mine the knowledge that’s already hidden in the grey matter of your existing team.