An Introduction to the Hadoop Ecosystem
Everyone who is learning more about Big Data will, within a short period of time, encounter the Hadoop software framework. When I first encountered it, I found the ecosystem (all with highly interesting names) very elaborate, yet at the same time confusing. The many components (or to stick to Apache terminology ‘related projects’) sometimes add to the confusion, because each of the ‘related project’ has different focus or orientation. In this article, I will therefore aim to provide a comprehensive overview of the Hadoop Ecosystem as I have encountered it.
Background, History and Objectives of the Apache Hadoop Project
Before we start to explore the Hadoop Ecosystem in further detail, it would be good to understand why Hadoop was created? As with all open-source inventions that scale quickly, Hadoop clearly solves a specific problem. So what is the problem that led to the creation of Hadoop?
Hadoop was originally created by Doug Cutting and Mike Cafarella in 2005, when they were working on the Nutch Search Engine project. The Nutch Search Engine project is highly extendible and scalable web crawler, used the search and index web pages. In order to search and index the billions of pages that exist on the internet, vast computing resources are required. Instead of relying on (centralized) hardware to deliver high-availability, Cutting and Cafarella developed software (now known as Hadoop) that was able to distribute the workload across clusters of computers.
With the development of Hadoop, Cutting and Cafarella solved the following problems:
- Instead of using centralized hardware, Hadoop distributes the workload over a cluster (i.e. a network of computers) which consist of commodity hardware. Because commodity hardware can be used, Hadoop provides a cost-effective solution for companies that need to process Big Data.
- Hadoop is designed to scale up (and down) from single servers to thousands of machines, making it a highly efficient for batch processing. In most enterprise organizations, high-processing workloads are only required for relatively small periods of time (i.e. not continuously). The scalability of Hadoop extremely suitable for batch processing, and consequently for users of Cloud Computing environments, which frequently work on a pay-for-use model.
- Because Hadoop distributes workloads over large clusters of commodity hardware, the library has been designed to detect and handle failures at the application layer. If for whatever reason the nodes in the network fail, Hadoop has redundancy built-in that will still complete the processing job. This makes Hadoop more resilient or fault-tolerant compared to traditional (centralized) solutions.
When you think about these benefits, it is not difficult to see why Hadoop has quickly grown to become the industry standard for distributed storage and processing. Add to these benefits that Hadoop is an open-source solution (i.e. non-proprietary), and you will understand why it has been embraced by almost all enterprise organizations.
And the name Hadoop? This remains one of the most simple, yet fascinating stories in computing. Hadoop was named after the yellow-stuffed toy elephant of the 2-year old son of Doug Cutting, who was just beginning to talk and called the elephant ‘Hadoop.’ In the words of Doug Cutting:
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.”
The Core Components of Hadoop: Common, HDFS, MapReduce and YARN.
When we take a more in-depth look at the Hadoop ecosystem, we can see that Hadoop can be broken down into four main components. These four components (which are in fact independent software modules) together compose an Hadoop cluster. In other words, when people are talking about ‘Hadoop,’ the actually mean that they are using (at least) these 4 components:
- Hadoop Common – Contains libraries and utilities need by all the other Hadoop Modules;
- Hadoop Distributed File System – A distributed file-system that stores data on commodity machines, providing high aggregate bandwidth across a cluster;
- Hadoop MapReduce – A programming model for large scale data processing;
- Hadoop YARN – A resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
In the next paragraphs, we will provide an overview of these four main components.
Hadoop is a Java-based solution, meaning it has been written in Java. Hadoop Common provides the tools (in Java) for a user’s computer so that it can read the data that is stored in a Hadoop file system (discussed next). No matter what type of operating system a user has (Windows, Linux, etc.), Hadoop Common ensures that data can be correctly interpreted by the machine.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. This means that Hadoop stores files that are typically many terabytes up to petabytes of data. The streaming nature of HDFS means that HDFS stores data under the assumption that it will need to be read multiple times and that the speed with which the data can be read is most important. Lastly, HDFS is designed to run on commodity hardware, which is inexpensive hardware that than be sourced from different vendors.
In order to achieve these properties, HDFS breaks data down into smaller ‘blocks,’ typically of 64MB. Because of the abstraction towards blocks, there is no requirement should be stored on the same disk. Instead, they can be stored anywhere. And that is exactly what HDFS does. It stores data on different locations in network (i.e. a cluster). For that reason, it is referred to a distributed file system.
Because the blocks are stored in a cluster, the questions of fault tolerance rises? What happens if one of the connections in the network fails? Does this means that the data becomes incomplete? To address this potential problem of distributed storage, HDFS stores multiple (typically three) redundant copies of each block in the network. If a block for whatever reason becomes unavailable, a copy can be read from an alternative location. Due to this useful property, HDFS is a very fault-tolerant or robust storage system.
MapReduce is a processing technique and program model that enables distributed processing of large quantities of data, in parallel, on large clusters of commodity hardware. Similar in the way that HDFS stores blocks in a distributed manner, MapReduce processes data in a distributed manner. In other words, MapReduce uses processing power in local nodes within the cluster, instead of centralized processing.
In order to accomplish this, a processing query needs to be expressed as a MapReduce job. A MapReduce job work by breaking down the processing into two distinct phases: the ‘Map’ operation and the ‘Reduce’ operation. The Map operation and takes in a set of data and subsequently converts that data in a new data set, where individual emblements are broken down into key/value pairs.
The output of the Map function is processed by the MapReduce framework, before being sent to the Reduce operation. This processing is sort and groups the key-value pairs, a process that is also known as shuffeling. Shuffling is technically embedded in the Reduce operation. The Reduce operation subsequently processes the (shuffled) output data from the Map operation and converts this into a smaller set of key/value pairs. This is the end results, which is the output of the MapReduce operation.
In summary, we could say that the MapReduce executes in three stages:
- Map stage− The goal of the map operation is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The map operation processes the data and creates several small chunks of data.
- Shuffle stage – The goal of the shuffle operation is to order and sort key/value pairs so that they are ordered into the right sequence.
- Reduce stage− The goal of the Reduce operation is to process the data that comes from the Map operation. After processing, it produces a new set of output, which will be stored in the HDFS.
The main benefit of using MapReduce is that it is able to scale quickly over large networks of computing nodes, making the processing highly efficient and quick.
Hadoop YARN (Yet Another Resource Negotiator) is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes. It was developed because in very large clusters (with more than 4000 nodes), the MapReduce system begins to hit scalability bottlenecks.
Hadoop YARN solves the scalability problem by introducing a resource manager that manages the use of resources across a cluster. The resource manager manages the responsibilities of the JobTracker, which on its own turn schedules jobs (which data is processed when) as well as task monitoring (which processing jobs have been completed). With the addition of YARN to Hadoop, the scalability of processing data with Hadoop becomes virtually endless.
Other components of the Hadoop Ecosystem
Besides the 4 core components of Hadoop (Common, HDFS, MapReduce and YARN), the Hadoop Ecosystem has greatly developed with other tools and solutions that completement the 4 main component. Some of the more popular solutions are Pig, Hive, HBase, ZooKeeper and Sqoop. Each of these components are discussed in separate articles in further detail.
 The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf).