At Improve Digital we capture significant data volumes that we have traditionally processed with batch-based tools on Hadoop. But as demands for reduced latency between data generation and availability grew we looked to start processing data in near-realtime. We have adopted Apache Kafka as our central system for data collection and distribution. Atop this we use Apache Samza to implement stream-based processing.
This presentation will introduce Kafka and Samza within the context of our adoption of the technologies. We will discuss the architectural challenges and opportunities brought about by incorporating Kafka and Samza and how we integrate them with our broader infrastructure. We will also look at some of our specific use cases – particularly streaming aggregation, alerting and machine learning – and describe some of the algorithmic approaches we have adopted.