Apache Hadoop provides a useful implementation of the Map Reduce paradigm to perform ETL (Extract Transform Load) processes. Apache Hive enhances Hadoop's capabilities by allowing users to interact with Hadoop via a structured query language. Even with Hadoop and Hive, building some solutions in a batch driven system can be cumbersome, particularly when dealing with time sensitive latency constraints. Stream processing can be used in tandem or as an alternative to batch processing. Stream processing possesses its own set of challenges at web-scale including new software stacks, application programmer interfaces, failure modes, and managing the software life cycle.
In this talk, Hive and Cassandra author (and Hive committer and PMC member) Edward Capriolo will discuss common big-data software challenges and how they can be solved using both batch and stream processing. Technology focus will primarily be on Apache Kafka for publish-subscribe messaging, Storm for stream processing, and Apache Cassandra as a NoSQL data store.