Configuring Mahout Clustering Jobs

Presented by Frank Scholten, JTeam

For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Topics include
Clustering introduction
Clustering in Mahout
Text pre-processing & analysis
Tag cloud demo
Tips & tricks

Loading more stuff…

Hmm…it looks like things are taking a while to load. Try again?

Loading videos…