Datastax_LogoIn this Episode I’m with Jonathan Ellis (co-founder and CTO of Datastax) and we talk about:

– Datastax
– NoSQL and Cassandra
– What’s new in Datastax Enterprise v.3.0

Here the the full transcript of the episode:

Enrico: Hi everyone and welcome to a new episode of Juku.beats Today, I am here with Jonathon Ellis founder and CEO of Datastax. Hi Jonathan, how are you?

Jonathan: I’m doing well. Thanks for having me on the program.

Enrico: Thank you for being here with me. Jonathan, the first question is about you and DataStax. I would like you to introduce yourself and the company, if you can.

Jonathan: I got involved with Cassandra. I’m not one of the original authors out of Facebook, but I got involved after they opened sources it in the summer of 2008. The short version is, we got Cassandra into the Apache Foundation. I became the first external committer and later the project chair of the Apache project. Subsequently, about a year after we became a top level project, I started DataStax to commercialize Cassandra.

Enrico: This leads me to the second question. What is the difference between Cassandra, the Open Source projects and DataStax?

Jonathan: DataStax’s provides a distribution, including Cassandra called DataStax Enterprise. It’s more than just commercially supporting Cassandra, but it includes enhancements around security, management tools. It includes integrated analytics and search. So, it’s really a broader data platform then Cassandra itself. Which is focused on being the best operational database in the industry.

Enrico: Okay. This sounds really interesting. Probably, I had to ask this question, in the beginning. Why a NoSQL and why Cassandra in particular?

Jonathan: So, I think what you’re seeing is a third wave of infrastructure really in the industry. In the ’80s you had people moving to relational databases. You had Oracle and DB2 and SQL Server come out of that. In the ’90s you had people moving from many computers and mainframes to a client server architecture. Now, you are seeing the migration to distributed scale out infrastructure. NoSQL is leading that on the database side of things. What you have is the Oracle and the traditional relational databases including the open sourced ones, like Postgres and my SQL. These are designed to deal with one companies worth of data. That’s really what they are designed to deal with. When you’re moving to a world where you are dealing with an entire country’s worth of data. That data set doesn’t fit well in the relational paradigm. You start to have to shard it across multiple machines and once you do that, then you’ve given up ACID. You have to reinvent, a lot of techniques that NoSQL is designed around.

One of the pioneers of this new era was eBay in the United States, the auction site. In 2002, they published their infrastructure around how they were sharding across multiple machines and they called it BASE, instead of ACID. BASE stands for Basically Available Soft-state and Eventually consistent. As people started to understand this new set of requirements, they were able to formalize that into new products that were designed around these principles. You have products like Cassandra. Even when people lost an entire data center in Hurricane Sandy, in the United States recently. Cassandra’s able to keep going through even that level of disruption. That’s because its designed around these … It’s not a master slave architecture, it’s designed around scaling across lots of machines. So, even if you lose a lot of those machines, the rest of them can keep going.

Enrico: I have just one more question. Usually I ask only three questions. In this case I’m just going to use three scenarios. I know that you are going to release a new version of the product really soon. 3.0. What’s the difference between 2.0 and 3.0?

Jonathan: Our goal with Cassandra is to the best database for web, mobile and IOT applications. What we’re focusing on with 3.0 is we’ve pretty much nailed the core requirements of scale availability and performance. We are trying to focus more on the developer story and making you more productive with Cassandra. One of the things that you’ll see in common across the NoSQL systems … That’s part of what gives it the name. Is that we don’t do joins, because if you are in a system where your data is scattered across lots of machines in a cluster. Doing a join means I have to pull data from the entire cluster and pull it together, shuffle it and sort it. That’s very expensive to that at run time.

Instead, we emphasize de-normalizing your data and writing multiple copies of it, that are organized according to the different queries you would want to run. By de-normalizing it, I will only have to touch a single partition on a single machine to get my query and it delivers the high performance. Doing the de-normalization by hand and your application is kind of honerous. It’s not fun. It’s just grunt work that’s not adding value to your organization. For 3.0 we’re adding materialized views. That’s the headline feature where you take a table that you’ve created and you got data inserted into it. You can say, create materialized view against this table. I want to re-partition it by a different key. I only ant data where this column is not null. So, I can do that and de-normalize that and Cassandra maintains that for me. Instead of me doing it in my application code.

Enrico: Where can we find you on the web? Also, where we can find DataStax?

Jonathan: On Twitter my handle, for historical reasons is spyced. Apache, Cassandra, of course is at Cassandra.Apache.org. DataStax.com is my company.

Enrico: That’s great. Thank you very much for being with us today.

Jonathan: Thanks for your time. Great to be with you.

Enrico: Bye.