During the last decade we have seen a tremendous increase in the quantity of available texts in the Internet. Consequently, text mining is faced with multiple challenges in order to deal with the vast number of documents and the size of modern data sets. One technique to address these challenges is a distributed computing environment which combines the resources of several workstations. Apache Hadoop provides such an open-source platform which we use to interface with tm, the text mining environment in R. We will present the underlying framework for uniform access, show techniques to distribute text mining tasks across Hadoop, and provide examples in order to handle large scale data sets in R.
Twitter provides so much data that we can easily model properties like gender, age, and geographical location. Strong correlations between language and such categories give us predictive models that can be highly accurate (e.g., we achieve state-of-the-art gender prediction of 88.9% on a corpus of 14,464 authors). But they also lead to an oversimplified and misleading picture of how language conveys personal identity. I'll present work done
jointly with Jacob Eisenstein and David Bamman that supports an alternative view. When we cluster authors based on their words alone (not their gender), we find that many clusters still have strong gender associations. But these clusters enact gender in a variety of ways that are more descriptively accurate and more interesting than the two styles ("male" and "female") that gender prediction work usually gives.