The craze around Big data has led to increasing number of startups and lot of venture funds being poured in. In this multi part series we will take a look at the complete Big Data technology stack and map the most important players and startups which is leading the industry. To give you an idea of how densely crowded it is, GigaOM estimates that are more than hundred big and small players. The tech giants are not sleeping either – IBM, Oracle, Microsoft and EMC are all trying to preserve their market share in the surprisingly overcrowded space.
We will not get into the details of what exactly Big Data is and the challenges that we face while working the same. But for the context of this article and to give a tweet style idea of what it is, let me put it this way – “BigData is a problem which is experienced when the volume of data grows exponentially beyond our current generation rdbms can handle. The basic idea of solving this problem is scaling out using distributed computing“. If you want to perform deep dive on what exactly it is, I highly recommend this article.
Lets take a look at the following BigData stack.
Note that I do not have each and every player here, it will be baffling if I try that. What do you notice first ? Bunch of references to Apache. Yes, most of the projects are managed by Apache Software Foundation and is open source. How did that happen ? Because most of the companies who experienced the Bigdata problem initially were web based companies whose crucial mission was to solve their business problems. They later contributed their toolsets back to Apache for general good.
How is this slide organized ?
On the left pane, you can see the group called major NoSQL databases. For those who have SQL background, here is an excellent video to watch before worrying or rejecting the NoSQL databases. There are different types of NoSQL databases each focusing on different use cases. MongoDB, DynamoDB and Cassandra are the most popular among them. Neo4j is a graph database and is receiving lot of attention lately. The most common use case for graph database is relationship among users when building any kind of networks. Some of the current implementations can be found here.
Another thing you may notice is the “Google Stack” here. Though none of Google technology is open source, the reason for inclusion is that some of these are inspired from Google white papers. For each level in the Big data stack that I detail below – you can find the equivalent Google term for reference.
Even though there is lot of difference in opinion and competition about other levels of stack, there is increasing unanimity about the basic framework – Hadoop. A basic building block of distributed computing – Hadoop is a framework which encompasses HDFS and MapReduce .HDFS is responsible for managing a fault tolerant and distributed file system. Like various distribution of Linux, you get vendor enhanced distribution of Hadoop also. The major players in this space are Cloudera, MapR, Hortonworks and Amazon EMR. Out of which Amazon EMR is Hadoop on AWS ( Amazon Web Services) cloud. The beauty of this is – you can spawn EMR instances on the fly. Hadoop as a Service is an industry nomenclature whereby a vendor provides Hadoop instances on the cloud based on subscription.
Map Reduce, Hive and Pig
Just having your data on multiple servers is not going to solve your problems. You need to process the data and these technologies will help you perform that. Map Reduce is a framework wherein you can process your data on multiple nodes and later combine them into a single result set. You can write you MapReduce programs on Java or other programming languages. Hadoop has also got streaming framework which supports multiple languages like Perl, Python and even good old shell scripts. Interested ? Check this awesome tutorial.
Writing custom map reduce programs are not always fun. Thankfully we have more and more alternatives now. Pig and Hive are the ones you should be definitely aware of. Pig can be viewed as an ETL tool for unstructured data. Developed by Yahoo, this tool will help you write program snippets for data transformation.
Hive let you write SQL like queries on unstructured data lying on HDFS. Hive has got a lot of flack because of its performance. But I absolutely love Hive, it is one of the reasons why Big Data has become popular. Hive has got an independent metastore where it preserves your defined data structures. Both Hive and Pig spawns Map Reduce jobs in the backend to process the data. If you are a BI developer and are new to Bigdata, these are the ones you should concentrate on !
Apache Accumulo and Hbase
Apache Accumulo and HBase are non-relational column store databases which can store sorted distributed key value pairs. Note that these are essentially NoSQL databases but I have categorized them separately because of their close association with HDFS . Both supports HDFS for storing the data. Accumulo was developed by now infamous US NSA ( National Security Agency) and contributed to Apache Foundation. The use case is to store huge unstructured data on HDFS and let you fetch them based on the key. Consider any web email service as an example. The fundamental difference between Accumulo and Hbase is the emphasis on tighter security on Accumulo ( surprise ehh 🙂 . Accumulo has ACL security on cell level.
Impala, Drill, CitusDB and Hadapt
Impala, Drill , CitusDb and Hadapt are all interactive analytical tools for large datasets
Everyone liked Hadoop because of its emphasis on distributed processing. But when it comes to data analytics there is a drawback that MapReduce, which is the fundamental foundation of Hadoop is geared towards batch processing. MapReduce is great for traditional data ware housing ETL pipeline but you cannot fire real time queries and get faster responses. By this time Hadoop was popular and early aspirants wanted to solve the problem on their own and make revenue out of the solution. Hadapt and CitusDB are proprietary solutions which let you query HDFS real time. Cloudera came out with Impala recently and has open sourced it under Apache license. Reviews are positive about the product. I have not got hands on with this yet but excited to know that it supports most sql functions. Drill is still an Apache Incubator project but has broader ambition of querying any NoSQL database instead of focusing only on HDFS. If you would like to dive more, here is good conversation on quora.
If you know about R, then this is no brainer. This is an elder brother of R which can leverage Hadoop’s capabilities. For others, Mahout is a machine learning software built with Hadoop integration. Most common use cases are recommendation systems( Amazon’s recommendation is an example)
Oozie WorkFLow Engine
As you have already read, there are many moving parts within the Hadoop framework. To ease the maintenance Oozie workflow engine was introduced. You can compare this to an enterprise job scheduler or control flow in an ETL engine.
One of the main drawbacks of the Hadoop framework has been the use of good old command shell. Cloudera realized that mainstream acceptance requires the framework to have good UI support. They introduced Hue which provides GUI support for Hadoop framework. Other vendors have their own proprietary UI. Cloudera Hue comes along with Cloudera CDH4 ( Virtual machine which has all hadoop framework preinstalled) which you can download and try.
Spark and Shark
Spark was designed to solve one of the bottlenecks in Hadoop which is the slow disk IO. Spark is an in-memory cluster computing system. Shark is Hive like interface for Spark. Interestingly Shark can read and write to HDFS.
In next part of the series I will talk about the remaining players unless you guys want to describe in detail on any particular subject related to BigData.
I hope the short introduction was of benefit to the newbies to the field. I am open to suggestions , feedback and critique. In a forthcoming part I will also touch base on the use of these tool sets to traditional data warehousing.