At AppNexus, we've experienced explosive growth over the last three years. Our data pipeline, horizontally scaled in Hadoop and Hbase, now processes more than 15 terabytes every day. This has meant the rapid scaling and iteration of our optimization tools used for big data exploration and aggregations. Unlike other more complicated programming languages, Python's versatility allows us to use it both for offline analytical tasks as well as production system development. Doing so allows us to bridge the gap between prototypes and production by relying on the same code libraries and frameworks for both, thereby tightening our innovation loop.
We'd like to share our best practices and lessons learned when iterating and scaling with Python. We'll discuss rapid prototyping and the importance of tightly integrating research with production. We'll explore specific tools including Pandas, numpy, and ipython and how they have enabled us to quickly data-mine across disparate data sources, explore new algorithms, and rapidly bring new processes into production.
This talk was presented at PyData NYC 2012: nyc2012.pydata.org/. If you are interested in this topic, be sure to check out PyData Silicon Valley in March of 2013: sv2013.pydata.org/