Wow, Hadoop World has come to pass and as predicted, Spark was the main topic on everyone’s mind. Just to give you a hint on the various Spark topics
- Paxata announces their Adaptive Data Prediction product built atop Spark.
- ClearStory Data announces Collaborative Storyboards that is also built atop Spark.
- GraphLab announces their tool GraphLab Create 1.0 integrated with Spark, Avro and Hadoop.
- Platfora announces version 4 of their product, which as you guessed it, is built atop Spark
- Trifacta teams with Databricks to extend Trifacta’s Data Transformation Platform to Spark
- Cloudera announces major support for Spark
- Strata/Hadoop World announces Spark as a separate track in their future conferences.
Remember this is in all in addition to major support from hadoop vendors like Cloudera, Hortonworks, MapR, Pivotal & IBM.
But for all that being said about Spark and its sexiness, one thing was prevalent with regards to Spark – Its scalability issues. Lots of people I spoke to said the same; it hasn’t being proven in large-scale environments. Although there is a lot of work currently going on to prove that Spark is scalable, not the least the recent blog by Databricks – announcing their record terasort time for sorting a 100 TB (23 mins) and 1 PB (234 mins) – (Pretty sweet, if you ask me.), lots of folk are waiting to see scalable use cases from major companies. I see this as the chicken and egg problem where everyone is waiting for the other to go and jump into the platform first and face all the issues that come with it. This is where I think startups lead – they can jump ahead of the curve and provide the much needed verification and validation as well be quicker to market their insights and analysis. Great times ahead!!
Also as an update, Datameer has implemented Tez as their in-memory engine for now. Spark is also their horizon but given the same reasons as the community, they will wait.