Named Entity Recognition and Disambiguation is the task of spotting
names of people, organizations, places etc. in natural language text and
disambiguating them to unambiguous identifiers. Several probabilities
and context similarity measures are typically employed to solve this
problem. Apache Pig is a framework for analyzing large datasets using a
high-level dataflow language on top of Apache Hadoop.
This talk focuses on the a concrete case of using Pig for estimating
probabilities related to Named Entity Recognition and Disambiguation
with Wikipedia as an input. The performance gain compared to a previous
single-machine implementation is significant, enabling more frequent
updates and more flexible evaluations and tuning.