Oct 29, 2013

Log Collection and Analysis with logstash (and Redis, ElasticSearch & Kibana)

In this post I want to show you how to setup a decent and complete infrastructure for centralized log management based on logstash - demonstrated on Apache Tomcat logs on Windows. This post is adapted from a Strata Conference 2013 tutorial by Israel Ekpo and the official logstash getting started guide.

Logstash is a lightweight and simple-to-use toolchain to collect, transform and analyze log data. To do so logstash requires a buffer to collect log events  (Redis) and a fulltext search engine as the final destination for the collected events (ElasticSearch). Assembled together this looks like this:

logstash architecture & data flow

A shipper detects new log events on his node and dispatches them to a central broker. The indexer polls the broker for new events and pushes them to the storage&search facility. The web interface is then able to query the storage & search facility on many ways. So - how to setup and wire all this stuff? These are the four necessary steps:

1) Basic setup

Sample application to produce logs
If you have no own Tomcat webapp to produce sample logs you can use this one: https://github.com/israelekpo/graph-service. Just checkout the code, re-configure the path to store the data, build it with maven and deploy the WAR to a Tomcat installation (the path to the installation is referred to as <catalina-home>). Then startup Tomcat and perform some requests. Further details are provided on the mentioned GitHub page. If you don't want to use cURL then you can also help yourself with Chrome Advanced REST Client.

Download the jar (logstash-<version>-flatjar.jar) from here and place it in any folder (referred to as <logstash-home>).

Download the archive from here and uncompress it into a folder (referred to as <elasticsearch-home>).

Download the archive from here and uncompress it into a folder (referred to as <redis-home>).

It's that easy.

2) The shipper
Now we've to to setup the agent which collects log data at a node and puts it into redis. All you need to do so is to write a configuration file (e.g. <somewhere>/logstash-agent.conf, referred to as <agent-config-filepath>) with the following content:

input {
 file {
  type => "CATALINA"
  path => "<catalina-home>/logs/catalina.log"
 file {
  type => "ACCESS"
  path => "<catalina-home>/logs/localhost_access_log*.txt"

output {
  stdout { codec => json }
  redis { host => "" data_type => "list" key => "logstash" }

This configuration file makes logstash to collect log entries from the Tomcat engine log and all Tomcat access logs (matched by a glob). The default for logstash is to perform a tail on each log file to process only new entries. If you want to initally read the whole file (and then perform a tail) you've to add start_position => "beginning" into the file block.

How to test it:

  1. Startup Redis: <redis-home>\redis-server --loglevel verbose
  2. Startup a logstash agent: java -jar <logstash-home>\logstash-<version>-flatjar.jar agent -f "<agent-config-filepath>"
  3. Make the app to produce logs.

On the Redis console you then should see the logstash agent connect:

To be really sure you can open the Redis command line client (on Windows: <redis-home>\redis-cli.exe) and check if there is appropriate data inside of Redis:

3) The indexer
The next step is to setup the indexer which transfers the log data from Redis into ElasticSearch. Same as above the only thing you've to to is to write a configuration file  (e.g. <somewhere>/logstash-indexer.conf, referred to as <indexer-config-filepath>) with the following content:

input {
  redis {
    host => ""
    data_type => "list"
    key => "logstash"
    codec => json

output {
  stdout { debug => true debug_format => "json"}
  elasticsearch {
    host => ""

How to test it:

  1. Startup ElasticSearch: <elasticsearch-home>\bin\elasticsearch.bat
  2. Fire up a logstash agent with the right configuration: java -jar <logstash-home>\logstash-<version>-flatjar.jar agent -f "<indexer-config-filepath>"

Now the whole toolchain should perform. To test if the log data really reaches ElasticSearch you can use its REST-API or better: Let the Sense Chrome extension help you. Just install & open it and run the default query. You should see your logs as indexed documents:

Sense output

4) The web interface
And now it's getting really professional - we also start a web user interface to analyze and visualize the collected log entries. It's that easy:

java -jar <logstash-home>\logstash-<version>-flatjar.jar web

Just point your browser to and start searching logs. The web interface is called 'kibana' - you can learn more about kibana at http://kibana.org.

Kibana user interface

The most important links:

Oct 28, 2013

The Big Data Puzzle

The big data ecosystem is currently on its expansion stage: A lot of technologies are popping up but too little consolidation happens. It's hard to keep track of the big picture. Today at the Strata Conference 2013 I visited some talks and participated in some discussions which helped me to better fit together some pieces of the big data technology puzzle:

  • High performance writes
  • Poor performance queries
  • Ideal partners for data logistics: Flume, Storm, Samza 
  • Supports data updates / deletes but no SQL
  • Best used for data streams (a flow of single-entry inserts) and to store the most recent data
  • High performance bulk data loading
  • High performance bulk data reading
  • Efficient data storage (if an efficient format like Parquet is used)
  • Ideal partners for data logistics: Pig, Cascading, Spring Batch
  • Best used as an eternal memory for data
  • Can access both HBase and HDFS stored data
  • Supports a subset of SQL
  • Best used for big-in / big-out queries e.g. large joins, data enrichment
  • Best used for batch processing (low CPU usage)
  • Can access both HBase and HDFS stored data and share metadata with Hive. Can be used side-by-side to Hive to complement it without replicating data between them.
  • Supports a subset of SQL and is compatible to the Hive API (but no real drop-in replacement).
  • Not as mature as Hive but some success stories present
  • Commercial MPPs like Vertica and Teradata are faster and more mature but Impala has a tighter integration into the Hadoop ecosystem and is therefore more flexible. Most important consequence: The data has not to be replicated into Impala like it has to be in Vertica et al. Impala can directly access HDFS/HBase data.
  • Best used for big-in / small-out queries e.g. aggregations, groupings
  • Best used for realtime queries (sec-to-min)
  • oozie: More mature and flexible. Larger set of features.
  • Azkaban: Nice and usable UI. Simpler to setup and use.
A possible outlook:

Storage & access layer
  • HDFS is and will remain the dominant virtual file system for big data.
  • The vast amount of (columnar) file formats (Parquet, HFile, RCfile, ...) will be consolidated. The beauty contest has already begun.
  • HBase will be the storage layer above HDFS for row-based access and data streams.
Query layer
  • There will be one major SQL-on-HBase/HDFS open source MPP database assembling the best of Impala, Hive, shark, ...
  • The choreography tools will be extended with intelligent cost-based scheduling capabilities.