S4: DISTRIBUTED STREAM COMPUTING PLATFORM
Leo Neumeyer & Anish Nair (Yahoo!)
Tuesday, January 25, 2011
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events, update state and optionally, produce event streams of their own. The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this talk, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our design is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
Leo Neumeyer studied electrical and computer engineering in Argentina and Canada. In 1992 he joined the Speech Technology and Research lab at SRI International (formerly Stanford Research Institute), where he helped build one of the most advanced speech recognition systems that was commercialized by its spin off company, Nuance Communications. Leo did research in signal processing, speech recognition, and language learning technologies. In 1999, he co-founded Mindstech International, a startup that developed technology to teach spoken English in Asia over the Internet. In 2006 he joined Yahoo! Labs where he led the search advertising optimization sciences group. More recently he championed S4, an open source distributed stream computing software platform that was developed to model user feedback in real-time to improve search revenue and user experience. He published over 24 technical papers and 8 patents.
Anish Nair is an applied scientist at Yahoo! Labs, working mainly on prediction and optimization problems in search monetization. His areas of interest and prior experience are natural language processing, information retrieval, speech recognition, personalization, and of late, stream computing. He has published work in various areas: computational linguistics, distributed systems, psychometrics and cognitive science. Prior to Yahoo!, he developed algorithms for automatically evaluating people's spoken ability, at Ordinate Corporation (now part of Pearson) and worked on various computational linguistics problems while a graduate student at USC. Anish's current focus is envisioning and developing applications for S4, the stream computing platform.