Feb 3, 2015

Apache Solr as a compressed, scalable, and high performance time series database time correlated data objects: How to store such amount of data on your laptop computer and retrieve any point within a few milliseconds? We answered that question at FOSDEM 2015 Conference in Brussels.

A relational database management system (RDBMS) like Oracle, MySQL or Microsoft SQL Server and a normalized data-schema does not work well on 68 Billion data objects in a time series. They have some unacceptable drawbacks for us
  • long import duration,
  • slow query and retrieval of data objects, 
  • high amount of hard dive space and 
  • are limited in scalability due to RDBMS.
There are open source time series databases available, including InfluxDB, OpenTSBD, RDDTool or SciDB and many more but neither of them fully complies to our major requirements. 
  • fast imports and queries
  • storing arbitrary metadata on time series as well on data objects
  • minimal hard drive space
  • everything should run on a laptop computer without performance drawbacks
We decided to create our own solution instead of using a solution that only complies to 50 percent of our requirements. We realized how easy it is, to build a perfect matching solution when choosing the right tools.

As we checked the features of Apache Solr against our requirements we found many positive matches. That's the reason why we chose Solr as the underlying storage component for our solution. Solr is a perfect tool for storing time series, if you follow three rules:
  1. Split one time series into multiple chunks of data objects
  2. Compress these chunks to reduce the storage amount
  3. Store the compressed chunks with the belonging metadata in one document
For example: If we want to store a time series with 68 Billion data objects, we can split them into 1 Million documents and each contains 68.000 compressed data objects.

Here's an high-level overview how the solution works in three steps:
The storage of a continuous time series in an document-oriented storage like Apache Solr
When dealing with such an amount of data you should remember one important concept: Predicate Push Down or in other words: Send the Query to the Data. We implemented a custom ValueSourceParser that does a server-side decompression and aggregation to enable predicate push down. In the image above is the compression and aggregation with the ValueSourceParser now done in the storage.

Some words to the performance:
Our solution allows us to store 68 Billion data objects in 40 GB storage volume and access any arbitrary data object within 20 - 30 milliseconds. Get dynamically the maximum of 68 Million data objects took 8 seconds.

We are planning to open source our solution to everyone:
Work is in progress and we are working hard to open source the very first version. To get in touch - see the figure which shows the query API.
An example of the query API in java.
If you are interested stay tuned.

For more details and the presentation at FOSDEM 2015 itself check out the slides on Speaker Deck

A question from the audience was how do you handle single data objects?
Our solution is primary implemented to store time series containing monitoring data that our tool (the EKG-Collector) has previously collected. It's easy to extend Solr with a RequestHandler that collects and compresses all single data objects once a day and store them in one document.

No comments:

Post a Comment