For a number of decades, I've been saying "Computing Is Like Hubble's Universe, Everything Is Getting Farther Away from Everything Else". It used to be that everything you cared about ran on a single database and the transaction system presented you the abstraction of a singularity; your transaction happened at a single point in space (the database) and a single point in time (it looked like it was before or after all other transactions).
Now, we see a more complicated world. Across the Internet, we put up HTML documents or send SOAP calls and these are not in a transaction. Within a cluster, we typically write files in a file system and then read them later in a big map-reduce job that sucks up read-only files, crunches, and writes files as output. Even inside the emerging many-core systems, we see high-performance computation on shared memory but increasing cost to using semaphores. Indeed, it is clear that "Shared Memory Works Great as Long as You Don't Actually SHARE Memory".
There are emerging solutions which are based on immutable data. It seems we need to look back to our grandparents and how they managed distributed work in the days before telephones. We realize that "Accountants Don't Use Erasers" but rather accumulate immutable knowledge and then offer interpretations of their understanding based on the limited knowledge presented to them. This talk will explore a number of the ways in which our new distributed systems leverage write-once and read-many immutable data..
Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time.
This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out.
After this talk, you'll see the architecture of your own applications in a completely new light.
Today there are a host of data-centric challenges that need more than a single technology to solve. Data platforms step in, blending different technologies to solve a common goal. ;But to compose such platforms we need an understanding of the tradeoffs of each constituent part: their sweet spots, how they complement one another and what sacrifices they make in return.;This talk is really a grand tour of these evolutionary forces. We’ll cover a lot of ground, building up from disk formats right through to fully distributed, streaming and batch driven architectures. In the end we should see how these various pieces come together to form a pleasant and useful whole.