Presentation hosted by DCJUG/Data Driven DC, Cloud DC, Nova Hadoop.
Description: As organizations move more applications to the cloud, there is an increased need for logging and monitoring of a heterogeneous software and infrastructure stack. In this meetup we want to explore some of the tools and technologies that can be used to perform "BigOps" - the ability to collect and process large amounts of data using some robust open source tools such as Apache Flume and its ability to integrate with Hadoop/HDFS.
Overview of Apache Flume
- What is Flume (going through the parts)
- Common Flume Architectures including use of Hadoop/HDFS as a sink
- Performance tuning tips with Flume
- Architecting for different levels of guarantees
- Working through different types of sinks and what they can offer.
Speaker: Ted Malaska, Sr Solution Architect at Cloudera
Ted has worked on close to 60 Clusters over 2-3 dozen clients with over 100's of use cases. He has 18 years of professional experience working for start-ups, the US government, a number of the worlds largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the US’s largest non-profit financial regulator. He has architecture experience across topic such as Hadoop, Web 2.0, Mobile, SOA (ESB, BPM), and Big Data. Ted is a regular committer to Flume, Avro, Pig and YARN.
There is so much text in our lives, we are practically drowning in it. Fortunately, there are innovative tools and techniques for managing unstructured information that can throw the smart developer a much-needed lifeline. In this talk, based on the outline of the book of the same name, I'll provide an introduction to a variety of Java-based open source tools that aide in the development of search and NLP applications.
Book Abstract: Taming Text is a practical, example-driven guide to working with text in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You'll explore real use cases as you systematically absorb the foundations upon which they are built. Written in a clear and concise style, this book avoids jargon, explaining the subject in terms you can understand without a background in statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.
Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he focuses on large scale analytics, distributed computing and machine learning. Previously, he worked at TextWise where he implemented a wide variety of text exploration, management and retrieval applications combining natural language processing, classification and visualization techniques. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master's degree in Information Resource Management from Syracuse University's iSchool and a B.F.A in Computer Graphics.
Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).
When the audience leaves, they will understand the typical use-cases for Cassandra and the audience will have the knowledge necessary to start playing with Cassandra on their own.
Matt Overstreet: Usability is Matt Overstreet’s mission. He has worked with Federal, Fortune 500, and small businesses to help collect, mine and interact with data. He solves problems by synthesizing his experiences drawn from a liberal arts and technical background.
Getting good search results is hard; maintaining good relevancy is even harder. Fixing one problem can easily create many others. Without good tools to measure the impact of relevancy changes, there's no way to know if the "fix" that you've developed will cause relevancy problems with other queries. Ideally, much like we have unit tests for code to detect when bugs are introduced, we would like to create ways to measure changes in relevancy. This is exactly what we've done at OpenSource Connections. We've developed a tool, Quepid, that allows us to work with content experts to define metrics for search quality. Once defined, we can instantly measure the impact of modifying our relevancy strategy, allowing us to iterate quickly on very difficult relevancy problems. Get an in depth look at the tools we use to not only search a relevancy problem — but to make sure it stays solved!
Doug Turnbull is a Search Relevancy Expert at OpenSource Connections. A frequent blogger and speaker, Doug enjoys the intersection of usability and systems programming. That's exactly what he finds in search — low-level code that directly impacts user's lives. In his search work, Doug bridges the gap between content experts and technologists. To help bridge the gap, Doug created [Quepid](quepid.io) a search relevancy collaboration canvas used extensively in OpenSource Connection's search work.
Hadoop is about more than MapReduce these days. How can you use new languages like Clojure, F#, Pig and HQL to get the best out of huge amounts of data? How can you use massive clusters of CPUs to make realtime apps with new frameworks like YARN and Tez that make Hadoop 2.0? By the end of this session you'll know how.
Simon Elliston Ball is a head of the Big Data team at Red Gate, focusing on researching and building tools to interact with Big Data platforms. Previously he has worked in the data intensive worlds of hedge funds and financial trading, ERP and e-Commerce, as well as designing and running nationwide networks and websites. These days his head is in Big Data and visualisation.
In the course of those roles, he’s designed and built several organisation-wide data and networking infrastructures, headed up research and development teams, and designed (and implemented) numerous digital products and high-traffic transactional websites.
For a change of technical pace, he writes and produces screencasts on front-end web technologies such as ExtJS, and is an avid NodeJS programmer. In the past he has also edited novels, written screenplays, developed web sites and built a photography business.