Talk "Test strategies for data processing pipelines" by Lars Albertsson at the High Load Strategy conference 2016 in Vilnius, Lithuania.
A good automated testing strategy is crucial for achieving good product development productivity, and for quickly launching new features with continuous deployment. Although there is high awareness in backend software development for the need of good test structures and disciplining, it is often added as an afterthought in data processing environments, resulting in slow code-test-debug cycles and long delays getting data-driven features out the door. The time it takes from deciding to collect a new type of data, adapting the data collection and data processing pipelines involved, to creating a feature based on the data is often measured in weeks or months.
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.