Entries posted in May 2018

On collecting metrics

Saturday, 26 May 2018

Here are some brief notes about metric-collection, for my own reference.

Collecting server and service metrics is a good thing because it lets you spot degrading performance, and see the effect of any improvements you've made.

Of course it is hard to know what metrics you might need in advance, so the common approach is to measure everything, and the most common way to do that is via collectd.

To collect/store metrics the most common approach is to use carbon and graphite-web. I tend to avoid that as being a little more heavyweight than I'd prefer. Instead I'm all about the modern alternatives:

  • Collect metrics via go-carbon
    • This will listen on :2003 and write metrics beneath /srv/metrics
  • Export the metrics via carbonapi
    • This will talk to the go-carbon instance and export the metrics in a compatible fashion to what carbon would have done.
  • Finally you can view your metrics via grafana
    • This lets you make pretty graphs & dashboards.

Configuring all this is pretty simple. Install go-carbon, and give it a path to write data to (/srv/metrics in my world). Enable the receiver on :2003. Enable the carbonserver and make it bind to

Now configure the carbonapi with the backend of the server above:

  # Listen address, should always include hostname or ip address and a port.
  listen: "localhost:8080"

  # "http://host:port" array of instances of carbonserver stores
  # This is the *ONLY* config element in this section that MUST be specified.
    - ""

And finally you can add your data-source to grafana of, and graph away.

The only part that I'm disliking at the moment is the sheer size of collectd. Getting metrics of your servers (uptime, I/O performance, etc) is very useful, but it feels like installing 10Mb of software to do that is a bit excessive.

I'm sure there must be more lightweight systems out there for collecting "everything". On the other hand I've added metrics exporting to my puppet-master, and similar tools very easily so I have lightweight support for that in the tools themselves.

I have had a good look at metricsd which is exactly the kind of tool I was looking for, but I've not searched too far afield for other alternatives and choices just yet.

I should write more about application-specific metrics in the future, because I've quizzed a few people recently:

  • What's the average response-time of your application? What's the effectiveness of your (gzip) compression?
    • You don't know?
  • What was the quietest time over the past 24 hours for your server?
    • You don't know?
  • What proportion of your incoming HTTP-requests were for HTTP?
    • Do you monitor HTTP-status-codes? Can you see how many times people were served redirects to the SSL version of your site? Will using HST save you bandwidth, if so how much?

Fun times. (Terrible pun is terrible, but I was talking to a guy called Tim. So I could have written "Fun Tims".)

| 1 comment.


This month has been mostly golang-based

Monday, 21 May 2018

This month has mostly been about golang. I've continued work on the protocol-tester that I recently introduced:

This has turned into a fun project, and now all my monitoring done with it. I've simplified the operation, such that everything uses Redis for storage, and there are now new protocol-testers for finger, nntp, and more.

Sample tests are as basic as this:

  mail.steve.org.uk must run smtp
  mail.steve.org.uk must run smtp with port 587
  mail.steve.org.uk must run imaps
  https://webmail.steve.org.uk/ must run http with content 'Prayer Webmail service'

Results are stored in a redis-queue, where they can picked off and announced to humans via a small daemon. In my case alerts are routed to a central host, via HTTP-POSTS, and eventually reach me via the pushover

Beyond the basic network testing though I've also reworked a bunch of code - so the markdown sharing site is now golang powered, rather than running on the previous perl-based code.

As a result of this rewrite, and a little more care, I now score 99/100 + 100/100 on Google's pagespeed testing service. A few more of my sites do the same now, thanks to inline-CSS, inline-JS, etc. Nothing I couldn't have done before, but this was a good moment to attack it.

Finally my "silly" Linux security module, for letting user-space decide if binaries should be executed, can-exec has been forward-ported to v4.16.17. No significant changes.

Over the coming weeks I'll be trying to move more stuff into the cloud, rather than self-hosting. I'm doing a lot of trial-and-error at the moment with Lamdas, containers, and dynamic-routing to that end.

Interesting times.

| No comments


Recent Posts

Recent Tags