Python has long been used as a language for crawling the web -- perhaps the most successful example being the early web crawlers built for the Google search engine. In recent times, open source libraries have improved dramatically for doing large-scale web crawling tasks. Further, the web has also matured in that many HTML pages now offer various metadata that can be extracted by well-equipped spiders, beyond the basics such as the text content or document title. This talk will cover Parse.ly's use of the open source Scrapy project and its own work on standardizing metadata extraction techniques on news stories.
This talk was presented at PyData NYC 2012: nyc2012.pydata.org/. If you are interested in this topic, be sure to check out PyData Silicon Valley in March of 2013: sv2013.pydata.org/