databricksIn this episode, I’m with Reynold Xin (Architect and founder of�Databricks) and we talk about:

– Databricks and its relationship with Spark
– Project Tungesten
– The future of Spark

Full transcript of this episode

Enrico:Hi, everybody, and welcome to another episode of Jucu Bez. Today, I’m with Reynold Xin, founder and architect of Databricks. Hi, Reynold, how are you?

Reynold Xin:Hey, good, Enrico, how are you doing?

Enrico:Good. So, as I said, you are one of the founders of Databricks, and Databricks is working on Spark, which is one of the most interesting technology on dataspace today. So, can you spend a few words about Databricks and its relationship with Spark.

Reynold Xin:Yeah, absolutely. So we started Databricks, basically, the whole team of Databricks was actually originally at USC Berkeley where Spark was born. This team actually created Spark, open sourced it, and then they founded a company and often to this day Databricks soupdrives Spark’s development, so a lot of the major developments in Sparks are actually started by Databricks and done by the engineers here. And on top of the opensource projects, we also provide platforms as a service. This is actually our business model and it’s based on Spark. And the goal of the platform is a service called database Cloud is to make it much, much easier for all of my visions [inaudible 00:01:13] big data.

Enrico:Good, so I don’t want to simplify too much, but Spark is a very powerful processing engine that is changing the way big data sets are processed thanks to its memory capabilities. There is a lot going along with it. In fact, I just read about a new project called Tungsten which should improve Spark efficiency even more. Can you tell us how this project will impact the future of Spark?

Reynold Xin:Yeah, Tungsten’s actually probably as we wrote recently on our blog post, one of the largest changes that happened to the Spark execution engine since the project started in the beginning. The goal of it is to improve Spark’s efficiency so we can squeeze every last bit of performance out of the underlying hardware. And what we found is, initially, when you’re running Spark applications is substantially faster than what you can write on to do map reviews, and it’s fairly easy to actually saturate your network and your hard drives. So it’s essentially I open a lot of the spark jobs. But with the advanced hardware now [inaudible 00:02:32] deployment of ten gig networks and a lot of SSDs. Now, actually, we are starting to see the trends are shifting a little towards CPU and memory. So a lot of spark jobs now are either CPU or memory bound. So the goal of Tungsten is through three different areas of improvements to push the performance boundary, actually back shifting from CPU and memory back to I/O and network. So, to our futures, this actually shouldn’t be much of a visual change except that applications are becoming faster. All of this is hidden under the same API, so they will still be using and interacting with the same API, except it’s becoming faster, and the other thing is that because with Tungsten, one of the major effort in tungsten is we are actually managing memory in Spark rather than using the JBM garbage collector. Because of this, we no longer need to master the black magic of trimming the JBM garbage collection, so as a matter of fact, it makes it easier for users to use and more performance.

Enrico:Wow, so it looks like Spark is going to change a lot of the core, and I know it’s quite hard to predict the future, but what can we expect by all these changes. What is the future of Spark; what other feature this project Tungsten will enable, and again, I want also to be sure that if I invested in Spark, I will get advantages from this new technology too.

Reynold Xin:Absolutely! That’s a very important question. We care deeply about compatibility of applications going forward, which means we are very cautious about changing APIs, especially for the core APIs that users are building on. So, if, nowadays our applications are building on the current [inaudible 00:04:44] API. Tungsten should have no impact on the application itself. You don’t need to rewrite it, you don’t need to rebuild it, you can just deploy it. And then you automatically see the benefits of the Tungsten effort because it is all hidden under the hood in Spark itself. The other thing that’s actually very important to the whole Spark project, even going beyond Tungsten, is we want Spark, which is a powerful engine, to benefit as widely a future as possible, and want to go beyond the traditional hardcore big data engineers. And to also leverage or actually enable the scientists, the statisticians, to actually have the capability to mine big data. There’s actually a lot of efforts in spark beyond Tungsten to make the Spark API even easier to use for those guys. So, basically, along with Tungsten, Spark will become easier to use and more performance for those guys.

Enrico:Thank you very much. I would like to ask you many more questions, but I think, hopefully, we will record another episode soon. So, thank you for much for your time.

Reynold Xin:Yep. Thanks a lot Enrico.

Enrico:Thank you, bye bye.

Reynold Xin:Bye.