Flexible Log Monitoring with Scribe, Esper, and Nagios

January 05, 2010

If you have yourself a pretty decent sized cluster, there’s probably a good chance that you’ve had the following experience: One day, while routinely browsing some server logs, you stumble upon some concerning entries that you wish you had been made aware of sooner. You could probably go back and write some custom scripts that email out warnings, or perhaps here’s a solution that might scale better for your needs.

The idea here will be to use Scribe, Esper, and Nagios/NSCA to build a flexible way to monitor your logs for issues. First, a little background from each of their corresponding websites:

Scribe
Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine. Scribe was developed at Facebook and released as open source.

Esper
Esper is a component for CEP and ESP applications, available for Java as Esper, and for .NET as NEsper. Complex Event Processing, or CEP, is technology to process events and discover complex patterns among multiple streams of event data. ESP stands for Event Stream Processing and deals with the task of processing multiple streams of event data with the goal of identifying the meaningful events within those streams, and deriving meaningful information from them.

Nagios
Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. Nagios monitors your entire IT infrastructure to ensure systems, applications, services, and business processes are functioning properly. In the event of a failure, Nagios can alert technical staff of the problem, allowing them to begin remediation processes before outages affect business processes, end-users, or customers.

Nagios Service Check Acceptor
NSCA allows you to integrate passive alerts and checks from remote machines and applications with Nagios. Useful for processing security alerts, as well as deploying redundant and distributed Nagios setups.

Now that we know a little bit about these technologies, here’s the game plan. Use Scribe to aggregate all of our logs of interest to a central location, process all of the log entries with Esper searching for dubious messages, and then forward on the alert to our monitoring infrastructure in Nagios.

Prerequisites
In order to save a little space in this post and to reuse the existing documentation around the net, we’re going to leave setting up Scribe and Nagios as an exercise to the reader. In the Scribe source tree on github, you’ll find a good tutorial for an example scribe configuration (see examples/README in the Scribe source). Nagios and NSCA will have packages available for most distributions, and there is plenty of documentation floating around for configuration and help if you should run into trouble. If you don’t feel like going through the hassle, or if you have some other infrastructure you’d prefer to use, just skip this part and I’ll outline some quick alternatives along the way.

Log Aggregation
The key first step to the process is collecting all our data to process into a single location. Admittedly, getting scribe installed can be a little tough, especially with the dependencies, but once you’ve set it up, there’s a return on investment. If you’ve set up scribe in a configuration like their 2nd example which sets up a scribe client and a central scribe server, a nice way to get your logs forwarded to the central scribe server is to use a nice python script called scribe_log by Silas Sewell. Another good suggestion for keeping scribe_log running smoothly is to use Supervisor, which is outlined on Silas’ blog post.

Log Processing
The next step is to process the aggregated logs looking for anomalies. For this, we’ll write some java code using the Esper libraries. We’ll create a class called SimpleEsperEngine and fill it with some configuration. Don’t worry about copy and pasting, at the end of the post there’s a link to the source on github.

private EPServiceProvider epService;
private EPStatement statement;

public SimpleEsperEngine(String esperStatement) {
Configuration config = new Configuration();
config.addEventType(LogLine.class);

this.epService = EPServiceProviderManager.getDefaultProvider(config); 
this.statement = epService.getEPAdministrator().createEPL(esperStatement);
}

public void addUpdateListener(UpdateListener listener) {
statement.addListener(listener);
}

private void processLine(String line) {
LogLine event = new LogLine(line);
epService.getEPRuntime().sendEvent(event);
}

This code sets up the Esper engine with a standard configuration and registers the LogLine event which we’ll define in the LogLine class later. Next we’ll define a simple notification that just prints out our status code and message, which will later correspond with the passive alert we’ll send Nagios/NSCA.

public void notifyCommand(NagiosStatus status, String message) {
System.out.println(String.format("Status: %s, Message: %s", status, message));
}

And here’s the LogLine class:

class LogLine {

private String line;

public LogLine(String line) {
this.line = line;
}

public String getLine() {
return line;
}

}

Finally, our main that pulls it all together. It’s likely that you’ll probably want to pass in your Esper statement or keyword in as parameters to the program, but I’ll leave that up for you to set up to your preferences. The key is the esperStatement, which says to count the number of times a LogLine event contains our keyword inside a 60-second sliding window. To be honest, the things you can describe in Esper can be quite complex, so this particular example is really only scratching the surface of what you can put together. (Also as a note, the 2 passed into the notifyCommand() method maps to the Critical status code for nagios, where 2=Critical, 1=Warning, 0=Okay)

public static void main(String[] args) {

final String keyword = "NullPointerException";
final String esperStatement = 
"select line, count(*) as cnt from LogLine.win:time(60 second) where line like '%"+keyword+"%'";
final String messageFormat = "CRITICAL: Found %s line(s) containing '%s'";

final SimpleEsperEngine test = new SimpleEsperEngine(esperStatement);
test.addUpdateListener(new UpdateListener() {
public void update(EventBean[] newEvents, EventBean[] oldEvents) {
EventBean event = newEvents[0];
long count = (Long) event.get("cnt");
if (count >= 1) {
try {
test.notifyCommand(NagiosStatus.Critical, 
String.format(messageFormat, event.get("cnt"), keyword));
} catch(Exception e) {
e.printStackTrace();
}
}
}
});

BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
try {
while (true) {
String line = in.readLine();
test.processLine(line);
}
} catch (IOException e) {
e.printStackTrace();
}
}

It’s also helpful to set up a little one-liner script like this one (don’t worry, it’s in the source tree as run-demo.sh)

java -cp classes:lib/antlr-runtime-3.1.1.jar:lib/cglib-nodep-2.2.jar:lib/commons-logging-1.1.1.jar:lib/esper-3.2.0.jar:lib/jsendnsca-core-1.2.jar:log4j-1.2.15.jar SimpleEsperEngine

As is, we can run this from the command line, and type input to System.in to test it out. In this case, whenever we type ‘NullPointerException’ on a line, we’ll see the CRITICAL message.

sh run-demo.sh

Adding Nagios to the Mix
Finally, to incorporate Nagios/NSCA into the mix, we’ll use the JSendNSCA library and modify our notifyCommand.

public void notifyCommand(NagiosStatus status, String message) {
NagiosSettings nagiosSettings = new NagiosSettings();
nagiosSettings.setNagiosHost("localhost");
nagiosSettings.setPort(5667);
nagiosSettings.setEncryptionMethod(NagiosSettings.NO_ENCRYPTION);

NagiosPassiveCheckSender sender = new NagiosPassiveCheckSender(
nagiosSettings);

MessagePayload payload = new MessagePayload();
payload.setHostname("localhost");
payload.setLevel(status.getStatusCode());

payload.setServiceName("TestMessage");
payload.setMessage(message);

try {
sender.send(payload);
} catch (NagiosException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}  
}

There’s plenty of Nagios configuration that needs to be done, but here’s the service I’ve defined that matches the alert that is sent from the program.

define service {
service_description     TestMessage
host_name               localhost
retry_check_interval    5
check_command           check_dummy!0
stalking_options        w,c,u
passive_checks_enabled  1
use generic-service
}

Once you’ve got it all configured you can put all the pieces together.
With scribe, just run:

tail -f /path/to/scribed/test/test_current | sh run-demo.sh &

and the send the exception through scribe with their included scribe_cat tool:

echo NullPointerException | scribe_cat test

Or, if you didn’t set up scribe, but want to tail your own log:

touch test.log
tail -f test.log | sh run-demo.sh &
echo NullPointerException >> test.log

So, that’s about the essence of it. Unfortunately, the topic of setting up Scribe and Nagios/NSCA were too long for this blog post, but at least they are fairly well documented out there. I’m sure you’ll find that there are countless uses for these three technologies, and that this has helped serve as an introduction to what might be possible. Happy coding. Thanks to the dev teams that put these things together, and another thanks for Silas for some handy utils.
– John H.

References
Source Code on GitHub: http://github.com/hitch17/esper-nsca-demo

Scribe: http://developers.facebook.com/scribe/
Esper: http://esper.codehaus.org/about/esper/esper.html
Nagios: http://www.nagios.org/
Nagios Service Check Acceptor: http://www.nagios.org/download/addons/
Supervisor: http://supervisord.org/
jsendnsca: http://code.google.com/p/jsendnsca/
scribe_log: http://www.silassewell.com/blog/tag/scribe_log/