Tagged In monitoring :
Flexible Log Monitoring with Scribe, Esper, and Nagios
If you have yourself a pretty decent sized cluster, there’s probably a good chance that you’ve had the following experience: One day, while routinely browsing some server logs, you stumble upon some concerning entries that you wish you had been made aware of sooner. You could probably go back and write some custom scripts that… Read more
Baking availability SLA into the code
Availability and Partition Tolerance are essential for many distributed systems. A simple (though not comprehensive) way to measure both is using response time SLAs between services as implied from Jeff Darcy‘s observation: Lynch (referring to the 2002 SIGACT paper) also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential… Read more