The case for Spark


I have been following Big Data, Hadoop in particular for the past four years. A lot has changed since then. I fell into this, back at the time, because I was looking for the next big wave of technologies especially on the backend, given that was my forte at the time. I started looking into Hadoop and the more I saw; the more I felt it was the right tool to get the job done.  Focusing on just the development aspects of job vs. actual maintenance of the servers/cluster was always a no brainer to me. But now that Hadoop has been adopted by lots of industries/companies and grown quite a bit, it has become apparent that some things were lacking. The two things in my mind where I see the most need and lots of work currently being done on in Hadoop is Security and Real-Time Analytics which covers real-time processing of the data as well real-time reporting of the results. A lot of big data vendors are spending a ton of resources in these two arenas. You can see that with lots of tools such as Yarn, Tez, Storm, Impala, Spark, Drill, etc. Also acquisitions of XA Secure and Gazzang are similarly being done to address the security aspects of Hadoop.

That being said, Mapreduce has always been the focal point of Hadoop since it was involved in processing most of the data. But there has the fundamental idea that they could be something better than Mapreduce. There has been lot of talk of Mapreduce going away. Maybe this is true, maybe not. But from my perspective, people will always look at tools that will claim to deliver the golden unicorn. By no means am I saying Spark is the unicorn everyone has been waiting but it has a lot of promise. Some of the things that make spark a viable choice are Spark SQL, Spark Streaming, MLib (machine learning) and GraphX. I believe Spark is part of the toolset to make real time processing more possible.

Being an evangelist, part of my job is to understand how the market shifts with time. Also with most major vendors on board with Spark (Cloudera, MapR, Datastax, Pivotal) there is a lot of that uncertainty removed. I truly believe that Spark is on the same upward trajectory that Hadoop was 4-5 years ago. There was great article I posted last week on What is Spark? You should read it you haven’t.

Comments are closed.