Apache Spark and the future of big data analytics
It’s the age of Big Data innovations and the open source community is no stranger when it comes to bringing out breakthrough platforms to compete with the immensely expensive proprietary technology market. One name which made it into the list of most active among big data open source projects globally in 2014 was Apache Spark. For beginners, Apache Spark is an open source computing framework created originally at the AMP Lab in Berkley. It is a cluster computing framework that guarantees up to 100 times faster performance for several applications thereby making it best suited for machine learning algorithms.
2014 was perhaps the most happening year for the project as it had over 456 contributors collaborating to make the framework more suitable for present day applications to run on it. Off late, several high profile industries have begun to realize the huge impact Spark can create when deployed in their real time IT ecosystem. From estimating financial risks in stock markets to configuring environment parameters in deep space explorations, Apache Spark is opening up a wave of new opportunities for data scientists and analysts to get more meaningful insights out of data.
Apache Spark is seen as the next big thing in data analytics and is perceived by many as a worthy competitor, or successor should we say, to the MapReduce, the data processing engine powering Hadoop. While the lack of speed and absence of in-memory queuing was described as the biggest drawback plaguing MapReduce, Apache Spark makes a meal out of these 2 features as its biggest USP. Spark allows processing of data streams unlike MapReduce which processes data in batches which causes considerable queuing delays not acceptable in several real time data intense applications.
While Spark may be basking in the glory of in-memory processing or in simple terms RAM processing, many experts consider Spark as not yet enterprise ready. They believe that Spark is a preferable option for a select set of operational analytics because it is still in its early if not nascent stage. A couple of years down the line and you have a fitting platform to run the likes of M2M communication, IoT, etc. to name a few. Until then Spark would be best called as the future of Big Data. But that future is something we all need to look aspiringly at.
Hadoop Vs Spark
So how does Spark fare against Hadoop MapReduce? Well, let us examine a few areas.
Spark is definitely going to put up a tough challenge to Hadoops’ MapReduce as evident by the speed comparisons. Real time tests have proved Spark to sort 100 TB of data in just 23 minutes when compared to the 72 minutes it took for Hadoop to accomplish the same using a number of Amazon Elastic Cloud machines. Spark accomplished the feat using just one tenth of the machines i.e. 206 compared to 2100 for Hadoop.
Spark runs on Hadoop just as MapReduce does but with the exception that MapReduce runs only on Hadoop. Spark on the other hand can go well with any resource manager like YARN or Mesos. This ability of Spark to run and exist without Hadoop is what data enthusiasts say is the biggest risk it poses for Hadoop’s dominancy in present day big data projects.
This time it’s Hadoops turn to fire. While Hadoop has an already established set of tools and best practices that are universally recognized, Spark is relatively young and even though it boasts of a thriving community at present, it will take time for a comprehensive growth of practices and support resources.
Readiness for deployment
Though several firms are getting into the Spark bandwagon, experts do have the notion that Spark is not as ready for full-fledged operations when compared to the established standards of Hadoop MapReduce. Most of the time, these organizations would have to create enhancements on the platform to make it work for them and this could lead to the loss of precious time which is saved in processing.
Apache Spark can be related to a whole new cockpit with knobs, switches and levers that have not been tested in rough skies. Those piloting it for the first time may have to undergo tons of reference checks on the manual. On the other hand, MapReduce is quite easy to configure given the time and exposure data scientists have had in configuring them in the past. In due time, Spark too will rise to the forefront but for now, configuring it is no child’s play.
Though not all fair and square, Apache Spark is here to stay. It only needs a little more time to mature and grow into its full capacity. Looking at the timeline, the last one year and a half has seen explosive growth in terms of contributors and penetration of the framework into newer application scenarios. There is no slowdown reported as of yet which makes Spark in a very favorable position to overpower Hadoop in the near future. Spark is a very exciting opportunity for enterprises at least on paper for the time being. It is met with the same enthusiasm that arose when solid state drives started to dominate over ordinary hard disk drives in terms of performance. We along with the entire open source community are keeping a close watch on Spark as we see a future with many possibilities surrounding faster data analytics. Watch this space for more.
Image Credits: businesskorea.co.kr