Big data gets a lot of press these days, but even if you're not geocoding the Twitter firehose, "big enough" data can be a pain - whether you're crashing your database server or simply running out of RAM. Distributed geoprocessing can be even more painful, but for the right job it's a revelation!
This session will explore strategies you can use to unlock the power of distributed geoprocessing for the "big enough" datasets that make your life difficult. Granted, geospatial data doesn't always fit cleanly into Hadoop's MapReduce framework. But with a bit of creativity - think in-memory joins, hyper-optimized data schemas, and offloading work to API services or PostGIS - you too can get Hadoop MapReduce working on your geospatial data!
Real-world examples will be taken from work on GlobalForestWatch.org, a new platform for exploring and analyzing global data on deforestation. I'll be demoing key concepts using Cascalog, a Clojure wrapper for the Cascading Java library that makes Hadoop and Map/Reduce a lot more palatable. If you prefer Python or Scala, there are wrappers for you too.
Hadoop is no silver bullet, but for the right geoprocessing job it's a powerful tool.