About Archive Tags RSS Feed

 

Entries tagged overseer

This month has been mostly golang-based

21 May 2018 12:01

This month has mostly been about golang. I've continued work on the protocol-tester that I recently introduced:

This has turned into a fun project, and now all my monitoring done with it. I've simplified the operation, such that everything uses Redis for storage, and there are now new protocol-testers for finger, nntp, and more.

Sample tests are as basic as this:

  mail.steve.org.uk must run smtp
  mail.steve.org.uk must run smtp with port 587
  mail.steve.org.uk must run imaps
  https://webmail.steve.org.uk/ must run http with content 'Prayer Webmail service'

Results are stored in a redis-queue, where they can picked off and announced to humans via a small daemon. In my case alerts are routed to a central host, via HTTP-POSTS, and eventually reach me via the pushover

Beyond the basic network testing though I've also reworked a bunch of code - so the markdown sharing site is now golang powered, rather than running on the previous perl-based code.

As a result of this rewrite, and a little more care, I now score 99/100 + 100/100 on Google's pagespeed testing service. A few more of my sites do the same now, thanks to inline-CSS, inline-JS, etc. Nothing I couldn't have done before, but this was a good moment to attack it.

Finally my "silly" Linux security module, for letting user-space decide if binaries should be executed, can-exec has been forward-ported to v4.16.17. No significant changes.

Over the coming weeks I'll be trying to move more stuff into the cloud, rather than self-hosting. I'm doing a lot of trial-and-error at the moment with Lamdas, containers, and dynamic-routing to that end.

Interesting times.

| No comments

 

Hosted monitoring

26 June 2018 19:01

I don't run hosted monitoring as a service, I just happen to do some monitoring for a few (local) people, in exchange for money.

Setting up some new tests today I realised my monitoring software had an embarassingly bad bug:

  • The IMAP probe would connect to an IMAP/IMAPS server.
  • Optionally it would login with a username & password.
    • Thus it could test the service was functional

Unfortunately the IMAP probe would never logout after determining success/failure, which would lead to errors from the remote host after a few consecutive runs:

 dovecot: imap-login: Maximum number of connections from user+IP exceeded
          (mail_max_userip_connections=10)

Oops. Anyway that bug was fixed promptly once it manifested itself, and it also gained the ability to validate SMTP authentication as a result of a customer user-request.

Otherwise I think things have been mixed recently:

  • I updated the webserver of Charlie Stross
  • Did more geekery with hardware.
  • Had a fun time in a sauna, on a boat.
  • Reported yet another security issue in an online PDF generator/converter
    • If you read a remote URL and convert the contents to PDF then be damn sure you don't let people submit file:///etc/passwd.
    • I've talked about this previously.
  • Made plaited bread for the first time.
    • It didn't suck.

(Hosted monitoring is interesting; many people will give you ping/HTTP-fetch monitoring. If you want to remotely test your email service? Far far far fewer options. I guess firewalls get involved if you're testing self-hosted services, rather than cloud-based stuff. But still an interesting niche. Feel free to tell me your budget ;)

| No comments

 

Offsite-monitoring, from my desktop.

20 October 2020 13:00

For the past few years I've had a bunch of virtual machines hosting websites, services, and servers. Of course I want them to be available - especially since I charge people money to access at some of them (for example my dns-hosting service) - and that means I want to know when they're not.

The way I've gone about this is to have a bunch of machines running stuff, and then dedicate an entirely separate machine solely for monitoring and alerting. Sure you can run local monitoring, testing that services are available, the root-disk isn't full, and that kind of thing. But only by testing externally can you see if the machine is actually available to end-users, customers, or friends.

A local-agent might decide "I'm fine", but if the hosting-company goes dark due to a fibre cut you're screwed.

I've been hosting my services with Hetzner (cloud) recently, and their service is generally pretty good. Unfortunately I've started to see an increasing number of false-alarms. I'd have a server in Germany, with the monitoring machine in Helsinki (coincidentally where I live!). For the past month I've started to get pinged with a failure every three/four days on average, "service down - dns failed", or "service down - timeout". When the notice would wake me up I'd go check and it would be fine, it was a very transient failure.

To be honest the reason for this is my monitoring is just too damn aggressive, I like to be alerted immediately in case something is wrong. That means if a single test fails I get an alert, as rather than only if a test failed for something more reasonable like three+ consecutive failures.

I'm experimenting with monitoring in a less aggressive fashion, from my home desktop. Since my monitoring tool is a single self-contained golang binary, and it is already packaged as a docker-based container deployment was trivial. I did a little work writing an agent to receive failure-notices, and ping me via telegram - instead of the previous approach where I had an online status-page which I could view via my mobile, and alerts via pushover.

So far it looks good. I've tweaked the monitoring to setup a timeout of 15 seconds, instead of 5, and I've configured it to only alert me if there is an outage which lasts for >= 2 consecutive failures. I guess the TLDR is I now do offsite monitoring .. from my house, rather than from a different region.

The only real reason to write this post was mostly to say that the process of writing a trivial "notify me" gateway to interface with telegram was nice and straightforward, and to remind myself that transient failures are way more common than we expect.

I'll leave things alone for a moment, but it was a fun experiment. I'll keep the two systems in parallel for a while, but I guess I can already predict the outcome:

  • The desktop monitoring will report transient outages now and again, because home broadband isn't 100% available.
  • The heztner-based monitoring, in a different region, will report transient problems, because even hosting companies are not 100% available.
    • Especially at the cheap prices I'm paying.
  • The way to avoid being woken up by transient outages/errors is to be less agressive.
    • I think my paying users will be OK if I find out a services is offline after 5 minutes, rather than after 30 seconds.
    • If they're not we'll have to talk about budgets ..

| No comments