Vanilla, an Open Source business intelligence application by bpm-conseil.com, offers unique features such as report indexing through an embedded Lucene integration. Using Vanilla and Lucene, developers can manage both report indexing and external document indexing, which ultimately saves end users time when they search for specific keywords such as "product code," or "customer code." Vanilla can build upon an existing Solr/Lucene installation that takes care of all the indexing processes while Vanilla takes care of the Reporting/Dashboard creation. During this presentation, attendees will learn how we moved from embed Lucene Api to a Solr/Lucene platform and all the technical and business benefits from this architecture in terms of clustering, caching and access mode.
Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we'll discuss a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others.
Presented by Eric Pugh, Principle, OpenSource Connections
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cell during document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr. Instead we'll talk about how to apply principles of big data like "Bring the code to the data, not the data to the code" to Solr. How to answer the question "How many servers will I need?" when your volume of data is exploding. Some examples of models for predicting server and data growth, and how to look back and see how good your models are! You'll leave this session armed with an understanding of why Big Data is the buzzword of the year, and how you can apply some of the principles to your own search environment.
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
What is Big Data and why is it important? What are the sources of Big Data and who gets to profit from it? What are the tricks, tools and techniques that turn Engineers into Data Scientists. The Social Data Revolution is at the forefront of this discussion, working on the issues by taking a customer centric approach to how data can created, coaxed and combined to create mutual value for consumers and organizations. The Big Data War Stories Meetup held in February 2012 gathered experts from Business and Technology to be moderated by Andreas to tell their stories from the frontline.
8 Rules for Big Data from Andreas Weigend, Social Data Lab
1. Start with the problem, not with the data
2. Share data to get data
3. Align interests of all parties
4. Make it trivially easy for people to contribute, connect, collaborate
5. Base the equation of your business on customer centric metrics
6. Decompose the business into its "atoms"
7. Let people do what people are good at, and computers what computers are good at
8. Thou shalt not blame technology for barriers of institutions and society
This conversation among real-life Big Data developers will be moderated by Dr. Andreas Weigend (http://www.weigend.com) who does happen to have a PhD in rocket science (ok, it's actually in physics).
Dr. Andreas Weigend (Moderator) directs the Social Data Lab at Stanford University. He was the chief scientist at Amazon, where he drove the customer-centric and measurement-focused culture that has been central to Amazon's success. Today Andreas speaks at top conferences around the globe, and most recently he shared his vision on the future of data at the United Nations. He's a great speaker who challenges and inspires his audiences.
Mark Torrance is Chief Technology Officer of Rocket Fuel, a web display advertising company that uses big data systems built on Hadoop, Hive, HBase, and MongoDB to analyze and optimize ad campaigns for their clients' performance metrics. He was the founder and CEO of StockMaster.com, the first financial service on the Web back in 1993, and since that time has continued to deliver compelling web applications and services to consumers and businesses. At his web consultancy Vinq, his team designed and implemented large scale web applications such as DataPlace.org, Knowledgeplex.org, a patent search application for Stanford University, and an AI based webmail system for DARPA. Mark holds degrees from Stanford and MIT.
Chuck Lam is the author of Hadoop in Action, a best-selling book for understanding and using Hadoop. He first got interested in data when he studied signal processing at San Jose State University. He went on to get a PhD in Electrical Engineering from Stanford, with a thesis on "computational data acquisition." He gained practical experience in large social data as a tech lead at RockYou, analyzing 100M+ users on its application platform and 20B+ monthly impressions on its social ad network. In the last two years he founded a mobile group coordination startup named RollCall, with funding from Charles River Ventures, Storm Ventures, and Kapor Capital.
Pete Warden is the founder and CTO of Jetpac, a company whose mission is to inspire travelers through their friends' experiences. After spending over a decade as a software engineer, including 5 years at Apple, Pete is now focused on a career as a mad scientist. He enjoys gathering, analyzing and visualizing the flood of web data that's recently emerged, trying to turn it into useful information without trampling on people's privacy. Pete is a serial entrepreneur who founded Mailana and OpenHeatMap in addition to Jetpac.
Raj Venkat is the Lead Engineer of the Item Catalog team @WalmartLabs, focusing on building the next generation Global Product Catalog that aims to position Walmart as a leader in the multi-channel eCommerce space. He has worked for over 15 years in a wide variety of business domains including Supply Chain, Financial Services, Product Management and eCommerce. But the common challenge across all of these domains has been data structure, volume and scale. Backed by a rockstar team, he is working to tackle hundreds of millions of items in an n-dimension space using a deadly combination of Cassandra, Hadoop, Hive, Solr and related technologies. He has previously held positions as Chief Architect at Third Pillar Systems and Senior Architect at Accept Corporation and has degrees from PSG Tech and IIT.
Thank you to Walmart Labs, First Retail & O'Reilly Strata Conference for providing us the venue & food.