Oct 28, 2013

The Big Data Puzzle

The big data ecosystem is currently on its expansion stage: A lot of technologies are popping up but too little consolidation happens. It's hard to keep track of the big picture. Today at the Strata Conference 2013 I visited some talks and participated in some discussions which helped me to better fit together some pieces of the big data technology puzzle:

  • High performance writes
  • Poor performance queries
  • Ideal partners for data logistics: Flume, Storm, Samza 
  • Supports data updates / deletes but no SQL
  • Best used for data streams (a flow of single-entry inserts) and to store the most recent data
  • High performance bulk data loading
  • High performance bulk data reading
  • Efficient data storage (if an efficient format like Parquet is used)
  • Ideal partners for data logistics: Pig, Cascading, Spring Batch
  • Best used as an eternal memory for data
  • Can access both HBase and HDFS stored data
  • Supports a subset of SQL
  • Best used for big-in / big-out queries e.g. large joins, data enrichment
  • Best used for batch processing (low CPU usage)
  • Can access both HBase and HDFS stored data and share metadata with Hive. Can be used side-by-side to Hive to complement it without replicating data between them.
  • Supports a subset of SQL and is compatible to the Hive API (but no real drop-in replacement).
  • Not as mature as Hive but some success stories present
  • Commercial MPPs like Vertica and Teradata are faster and more mature but Impala has a tighter integration into the Hadoop ecosystem and is therefore more flexible. Most important consequence: The data has not to be replicated into Impala like it has to be in Vertica et al. Impala can directly access HDFS/HBase data.
  • Best used for big-in / small-out queries e.g. aggregations, groupings
  • Best used for realtime queries (sec-to-min)
  • oozie: More mature and flexible. Larger set of features.
  • Azkaban: Nice and usable UI. Simpler to setup and use.
A possible outlook:

Storage & access layer
  • HDFS is and will remain the dominant virtual file system for big data.
  • The vast amount of (columnar) file formats (Parquet, HFile, RCfile, ...) will be consolidated. The beauty contest has already begun.
  • HBase will be the storage layer above HDFS for row-based access and data streams.
Query layer
  • There will be one major SQL-on-HBase/HDFS open source MPP database assembling the best of Impala, Hive, shark, ...
  • The choreography tools will be extended with intelligent cost-based scheduling capabilities.

No comments:

Post a Comment