As to answer the question of what is Hadoop, we first must understand the underlying issues related to Big Data and the traditional processing system. If you aren’t sure if you understand what Big Data is, we recommend reading our article to gain a better understanding of it, as it is a prerequisite to understanding Hadoop.
Going into this article, we will explain what is Hadoop, and how it serves as a solution to the underlying problems of Big Data.
What is Hadoop?
Hadoop is a collection of open-source software utilities that distribute the storing and processing of data across a network of computers instead of one central location.
Applications created using Hadoop are run on large data sets that are distributed across clusters of commodity computers. Commodity computers are cheap and widely available, in contrast to more powerful and more expensive super computers.
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.” – Doug Cutting
Hadoop was originally created by Doug Cutting at Yahoo! and inspired by the MapReduce function developed by Google in the early 2000s for indexing web traffic. By now, Hadoop has grown to become the standard software framework for processing Big Data and is used by most major Big Data solutions providers.
The original idea of Hadoop was to create something that would enable the storing and processing of data in a distributed, automated way so that it would be faster and more efficient. It provides a cost-effective solution for storing and processing massive amounts of structured, semi- and unstructured data with no format requirements. This makes Hadoop perfect for building data lakes to support big data analytics.
The importance of Hadoop:
Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Within the limits of power and cost, essentially any number of nodes can be incorporated within a single supercomputer.
What is a Commodity Cluster?
Clusters of commodity computers exploit the economy of scale of their mass produced subsystems and components to deliver the best performance relative to cost in high performance computing for many user workloads. This method of computing is more efficient than using fewer high-performance but high-costing computers. While cluster computing doesn’t deliver the absolute best performance in the field, is it the method more likely to be encountered in a typical machine room, even when a data centre may contain a few high performance and expensive parallel processors among its other computing resources.
Figure 1: Example of a commodity cluster data storage room.
The Four Modules of Hadoop
In the Hadoop system, there are four ‘modules’ that each carry out a particular task essential for a computer system designed for big data analytics. These four modules are:
Distributed File System
1. Distributed File System
In order to analyse massive quantities of structured and unstructured data, data needs to be broken into “parts,” which are then loaded into a distributed storage system made up of multiple nodes running on commodity hardware.
The Hadoop Distributed File System (HDFS) is the file system that enables these data parts to be stored on different machines (commodity hardware) in a cluster. HDFS therefore enables distributed storage.
A file system is a method of which a computer stores data so it can be found and used at any given time. It is usually determined by a computer’s operating system as different operating systems may have different distributed file systems.
In contrast, a Hadoop system uses its own file system which, in a sense, sits ‘above’ the file system of a computer. This means it can be accessed using any computer running any support operating system.
The Hadoop distributed file system takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS and then the HDFS creates multiple replicas of data blocks distributes them on compute nodes in a cluster. This method of distribution enables reliable and extremely fast computations.
One of the core properties of the HDFS is that each of the data parts is replicated multiple times and distributed across multiple nodes within the cluster. If one node fails, another node has a copy of that specific data package that can be used for processing. Due to this, data can still be processed and analysed even when one of the nodes fails due to a hardware failure. This makes HDFS and Hadoop a very robust system.
Since HDFS stores multiple copies of the data parts across different nodes in the cluster, it is very important to keep track of where the data parts are stored, and which nodes are available or have failed. The NameNode performs this task. It acts as a facilitator that communicates where data parts are stored and if they are available. The NameNode is the centrepiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Figure 2: How NameNode works
Once the data parts are stored across different nodes in the cluster, it can be processed. The MapReduce framework ensures that these tasks are completed by enabling the parallel distributed processing of the data parts across the multiple nodes in the cluster.
Figure 3: MapReduce diagram showcases how data is processed with the MapReduce function
The node that initiates the Map procedure is labelled the Job Tracker (discussed next). The Job Tracker then refers to the NameNode to determine which data is needed to execute the request and where the data parts are located in the cluster. Once the location of necessary data parts is established, the Job Tracker submits the query to the individual nodes, where they are processed. The processing thus takes place locally within each node, establishing the key characteristic of distributed processing.
The Job Tracker is the node in the cluster that initiates and coordinates processing jobs. Additionally, the Job Tracker invokes the Map procedure and the Reduce method.
Figure 4: Diagram further explaining the Job Tracker function in MapReduce
The first operation of the MapReduce framework is to perform a “Map” procedure. One of the nodes in the cluster requests the Map procedure – usually in the form of a Java query – in order to process some data.
The second operation of the MapReduce framework is to execute the “Reduce” method. This operation happens after processing. When the Reduce job is executed, the Job Tracker will locate the local results (from the Map procedure) and aggregate these components together into a single final result.
This final result is the answer to the original query and can be loaded into any number of analytics and visualisation environments.
Simply put, MapReduce reads data from the database and puts into a format suitable for analysis (map) and performs mathematical operations i.e. counting the number of customers aged 20+ in a database (reduce).
3. Hadoop Common
The third module is the Hadoop Common. This is the module that provides the tools needed for the user’s computer systems to read data stored under the Hadoop File system. Like the other modules, Hadoop common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop framework. It can be considered that the Hadoop Common module provides the base/foundation of the framework as it provides the essential services and procedures for the framework.
The last of the four modules is YARN. Yarn stands for ‘Yet Another Resource Negotiator’. This is the module that manages the resources of the systems storing the data and running the analysis. It is the resource management and job scheduling technology in the Hadoop distributed processing framework. It is responsible for allocating system resources to the various applications running in a Hadoop Cluster and scheduling tasks to be executed on different cluster nodes.
Real World Example of Hadoop – Facebook
“Facebook runs the world’s largest Hadoop cluster” – Jay Parikh, Vice President Infrastructure Engineering, Facebook.
135 terabytes of compressed data is processed daily and 4 terabytes of compressed data is added daily. Facebook has a Hadoop warehouse with two level network topology having 4800 cores, 5.5 PB storing up to 12 terabytes per node. 7500+ Hadoop hive jobs run in production cluster per day with an average of 80,000 compute hours. Non-engineers i.e. analysts at Facebook use Hadoop through hive and approximately 200 people/month run jobs on Apache Hadoop.
This extensive cluster provides some key abilities to developers:
The developers can freely write map-reduce programs in any language.
SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to the developers with small subsets of SQL.
Hadoop/Hive warehouse at Facebook uses a two level network topology –