Spotify uses a range of large scale machine learning methods to find interesting music recommendations. Using large amounts of implicit data, collaborative filtering is behind features such as radio, related artists, and a number of soon to be released features. These are powered by matrix factorization and other methods that have been scaled up to 100s of billions of data points.
This presentation aims to showcase how to build and implement a search engine which is able to understand a query written in a way much nearer to spoken language than to keyword-based search using Apache Lucene/Solr and Apache UIMA. A system which can recognize semantics in natural language can be very handy for non expert users, e-learning systems, customer care systems, etc. With such a system it's possible to submit queries such as "hotels near Rome" or "people working at Google" without having to manually transform a user entered natural language query to a Lucene/Solr query.
The Solr - UIMA integration (since Solr 3.1.0) can help on building such intelligent systems using NLP / Text mining algorithms on documents being indexed and on queries written by the user.
This module gives Solr the ability of calling UIMA pipelines when documents are indexed to trigger automatic extraction of metadata (i.e. named entities like people, places, organizations, etc.) using existing and custom algorithms as UIMA analysis engines. The talk will cover:
The Solr - UIMA integration
Introducing UIMA to Lucene's analysis phase
Running existing open source NLP algorithms in Lucene/Solr
Orchestrating blocks to build a sample system able to understand natural language queries
We'll introduce these points using examples (architectures & code) and a sample demo system.
Jeff Ullman is the Stanford W. Ascherman Professor of Computer Science (Emeritus). His interests include database theory, database integration, data mining, and education using the information infrastructure.
Some of the most profound ways in which the Web changes our lives would not have happened without a heavy dose of computer-science theory. PageRank, and how it makes Google work, is a well-known example, but there are many others. We shall explore briefly some of the interesting algorithms, such as PageRank variants, minhashing, and locality-sensitive hashing that have given us surprising capabilities.