Think you need Hadoop? Think again
Today’s post is written by Rick DelGado. His guest post is about the need of Hadoop. Thanks Rick for your input.
In the big data buzz, Hadoop has been the big data solution of choice leaving many feeling like Hadoop is their only option for harnessing big data. However, there are many other big data options out there that offer different features than Hadoop, and may actually fit your business needs better. Flash array storage, in particular, has made it easier to create fast, affordable storage options, so check out these other big data solutions before settling on Hadoop.
This system developed by LexisNexis is similar to Hadoop in that it is used to build clusters of servers for the purpose of analysing large data sets. HPCC uses Enterprise Control Language to make the process of writing parallel-processing workflows easier, and, also like Hadoop, has an ecosystem of tools built around it including the Roxy Rapid Data Delivery Cluster, a data warehouse similar to HBase, and the Thor Data Refinery Cluster, a data processor.
While still in its preliminary stages, HPCC’s similarities to Hadoop make it a running alternative as a big data solution.
Storm, first developed by Backtype before being purchased by Twitter, was dubbed the “Hadoop of real time processing” in a blog post by Nathan Marz. This distinction was made because Hadoop is a batch processor that works with fixed data. Storm, on the other hand, can process data as it streams. Of course, the real-time issue is being addressed in Hadoop providing some competition for Storm.
Developed by Nokia Research, Disco Project has been around for a while without a lot of attention. With Disco Project, data is distributed and replicated in a similar manner to Hadoop, and Disco has job scheduling features. However Disco Project doesn’t use its own file system. The advantage of Disco is that it’s backend is written in Erlang, a language with support for fault tolerance, distribution and concurrency.
Spark was specifically created by UC Berkeley to make it faster to write and run data analytics and is one of the newest options on the market. A key difference from other MapReduce systems is that Spark permits in-memory querying of data instead of disk I/O. Spark also performs better than Hadoop on several iterative algorithms and is written in Scala—an object-oriented language that permits users to make queries directly from the Scala interpreter.
GraphLab was created to make designing and implementing parallel machine learning algorithms easier. It varies from MapReduce in that it has an update phase that can read and modify data sets that overlap while MapReduce requires that all data sets be separated. GraphLab also offers its own version of the reduce stage called sync operation in which the output is global rather than local.
Microsoft offers three Hadoop Alternatives.
Azure Table Storage is offered in Microsoft’s cloud and is meant to serve as an alternative data store to the one provided by Hadoop. It is not an analytics system.
LINQ to HPC allows you to build clusters of servers in Microsoft’s programming language and complete data analytics with unstructured data just like Hadoop does.
Azure Project Daytona is a research project that is based on MapReduce. It runs as a service that provides ready-to-use algorithms and can be delivered through Excel.
As you can see, there is actually plenty to choose from when looking for a big data solution, and an added benefit is that many of these tools can work together, including with Hadoop, to create a customized solution for your individual needs.
“I’ve been blessed to have a successful career and have recently taken a step back to pursue my passion of writing. I’ve started doing freelance writing and I love to write about new technologies and how it can help us and our planet.” – Rick DelGado