Menu

Flexible Log Monitoring with Scribe, Esper, and Nagios

January 05, 2010

If you have yourself a pretty decent sized cluster, there’s probably a good chance that you’ve had the following experience: One day, while routinely browsing some server logs, you stumble upon some concerning entries that you wish you had been made aware of sooner. You could probably go back and write some custom scripts that… Read more

Baking availability SLA into the code

December 09, 2009

Availability and Partition Tolerance are essential for many distributed systems. A simple (though not comprehensive) way to measure both is using response time SLAs between services as implied from Jeff Darcy‘s observation: Lynch (referring to the 2002 SIGACT paper) also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential… Read more