I want to share a couple things that our team has learned along our journey into real-time production monitoring. Mainly, it’s about the realizations we’ve had surrounding the importance of shortening the feedback cycle when it comes to designing dashboards during development.
Much has been said and written about visualization of what your app does in production – I’m sure you get it by now. I’m talking here about the importance of having the exact same view you would have in production at development time and what a powerful realization this has been for the team and I.
Maybe the easiest way to start is with a metaphor.
Imagine that you were a carpenter, building a wooden nightstand for someone. You are handy with the tools, maybe even a master carpenter – you really know your stuff, development-wise. Imagine though, that you are crafting each part to perfection – you have a great blueprint to work from and all the finest tools, but all your work is being done in a darkened box behind a curtain. To shape and assemble the nightstand, you need to reach into the dark box and fumble with all the pieces and then keep assembling them and testing out the nightstand to make sure it seems to be working properly – that the drawers are opening, the top is on top, the bottom is on the bottom etc.
Doesn’t that sound…kind of stupid?
But let’s bring it into the dev world now: pretend that we’ve built a cutting edge, successful training platform. We’re “capital A” Agile Development, we’ve got a tight set of unit tests, an acceptance test. We’ve also got some form of delivery pipeline and we’re somewhat regular delivering to production. When suddenly it starts to hit us:
How could we have seen this coming?
I say that the usual methods of contrived testing would not have revealed the problem and that’s because the problem didn’t come in the form of an error, it came in the form of an omission. I’ve got this thought – mind you it isn’t fully formulated yet – that monitoring these types of metrics is akin to constantly running acceptance tests telling you that business value IS being delivered.
How Have We Done It?
We have been leaning on open source tools – they could be more mature but it has been working great for us, nonetheless. We push our metrics to statsd, which aggregates and forwards them to Graphite. Graphite efficiently stores time-series data, query and applies crazy interesting functions to it. We then use Grafana to create graphs from Graphite data and put together entire dashboards to measure app activity. Keeping an eye on trends uncovers issues and opportunities you would have otherwise not seen.
The dashboards need to be managed like code and you want them consistent across environments. For us, having new graphs show up in all environments is huge, so our dashboard nodes update on commit. To be clear, all of this is a developer task – you are going to be using it, you are going to benefit from it.
Capturing log entries and metrics is dead simple. It’s knowing what to capture that is difficult and we really don’t know if the data we’re capturing is going to be helpful until we visualize it. We get it wrong all the time, of course, but that’s the point here; what makes sense when you’re typing away in your editor often makes zero sense at the user level. Would you ship a new UI without ever looking at it?
The fact is that you need to know in advance what will happen in production and the best way to do that is to have a production-evolved model to refer to while you’re building it.
Once you accept this idea, you also need to know how easy it is to shorten the feedback cycle by scripting a production-like world on your laptop. When you realize the tiny investment to build this little world for yourself, you will never return to the old way of working.
I want everyone to re-think how we look at this – make that production insight a first-class citizen right from the get go, day one of starting to write the code.
If you’d like to see another piece where I get into the weeds a bit more on this, let me know in the comments or ping me on Twitter and I’ll put something together. Here’s my deck from the talk I’m giving on this topic today:
Mario leads the Software Platform Team at 360incentives. Amid raising three kids under the age of five, Mario finds time to hack on various projects and play Aussie Rules football. Connect with him on Twitter at @mario_pareja.