May 17, 2017

ApacheCon / Apache BigData - Day 1

The Apache Foundation event management team is really excellent in choosing venues for their conferences. After Vancouver, BC last year this year's ApacheCon and Apache BigData takes place in beautiful Miami, FL. Following my conference coverage of day 1. See day 2 coverage here.

Notebooks for data analysis are very en vogue. Apache Zeppelin and Jupyter are the super heroes in that area. Pixiedust is a nice extension to Jupyter providing easy-to-use data visualization primitives. Helium is a new plugin system and package repository for Zeppelin providing various ready-to-use Zeppelin extensions (visualizations, interpreters, spell).

Basically no surprise but a little bit surprisingly intensive is the promotion of Apache CloudStack as open source IaaS platform and competitor to OpenStack. I thought this war is over and OpenStack is the clear winner - but Apache doesn't want to capitulate.

Flink and Spark ... and Beam
Flink seems to be at eye level with Spark. Each time Spark is mentioned also Flink is mentioned. Apache Beam is also very good covered at the conference providing an abstraction layer atop of both. But concerning Apache Beam I'm very suspicious of abstraction frameworks of abstraction frameworks. Beam is also an abstraction for Google Cloud Dataflow. So it maybe also exists for Google having a "no vendor lock-in" argument. Btw.: Google is one of the most contributing companies to Beam.

There are two new players around in the field of messaging systems. In the range between Kafka and classical messaging systems like ActiveMQ and RabbitMQ RocketMQ is just in the middle. RocketMQ is an open source contribution of Alibaba - one of the largest web-scale companies on earth. You can find a nice comparison chart of RocketMQ with Kafka and ActiveMQ here. RocketMQ provides more guarantees compared to Kafka like strict ordering but at a price: It's based on a master/slave architecture so it's not as scalable like Kafka. But compared with ActiveMQ and RabbitMQ it has a significant higher throughput through leveraging the pull/distributed log principle of Kafka. As RocketMQ also provides a JMS interface it could be on a real sweet spot between Kafka and ActiveMQ/RabbitMQ. Apache DistributedLog is not a full fledged messaging solution but a building block therefor. It provides a distributed log implementation - f.e. Kafka is also based on a distributed log. Allegro open-sourced Hermes, a message broken on top of Kafka extending Kafka with REST pub/consumer interfaces, message tracing and monitoring, and guaranteed message delivery at a sub-millisecond cost atop of Kafka.

Hardware Diversification
Spark and others are prepared to support diverse Hardware like GPUs, TPUs and non-volatile / durable RAM ... also with a talk on QAware research project "how to leverage the GPU on Spark". There is also a native lib from Intel (Math Kernel Library) which claims to speed-up ML use cases on Spark by 9x at no additional cost.

Dataservices is a new way how to process data and an alternative to Spark and Flink if you want to implement and run data processing applications atop of a microservice platform. I did a talk on how to implement dataservices with Spring Cloud Data Flow.
Others proposed to use a serverless framework like OpenWhisk to implement dataservices.

No comments:

Post a Comment