Thursday, June 18, 2009

google app engine monitoring service lessons learned

As mentioned earlier, we were in need of some monitoring for our co-loc. After some experimentation, it turned out to be easiest to build in the co-loc almost entirely in nagios. For the external monitoring component, I deployed a web proxy called mirrorrr onto google app engine to provide us with basic functionality from outside of the co-loc to check dns and other potential network issues. I need to follow up with a single monitor on the co-loc itself from google app engine, but our existing monitor from siteuptime can suffice for now.

Nagios appears to have a couple annoying quirks regarding DNS though. It seems to be rather insistent on quering IP addresses for low level services, rather than doing more of a system-level monitoring.
  • check_http appears to resolve the hostname into an IP address before checking it, unless the -H option is used. This breaks google app engine and anything else that uses a virtual hosts-like mechanism to determine what page to serve.
  • check_smtp does not seem to do an MX lookup for a host. Instead, it resolves to our web server and tries to open the web server for SMTP.
  • mirrorrr has a 1 hour cache by default. It should be minimized or disabled when used for monitoring.
  • TBD: The local mirrorrr install should probably get an IP range filter added to it so that it is more difficult to DOS.
  • nagios isn't too happy about passing messages around machines. My main options appear to involving choosing the least of three evils:
    • adding a private key to the nagios machine and sshing everywhere
    • installing NPRE as a daemon on every machine and querying them (it's not that lightweight and most of the servers need more memory badly)
    • calling ncsa_send from some cron shell scripts to a relatively insecure mechanism on the nagios machine. I opted for ncsa_send, but the port is only visible to the intranet, and there are a couple of machines in a dmz that can't reach it easily.