The Four Modules of Hadoop
In the Hadoop system, there are four ‘modules’ that each carry out a particular task essential for a computer system designed for big data analytics. These four modules are:
- Distributed File System
- Hadoop Common
1. Distributed File System
In order to analyse massive quantities of structured and unstructured data, data needs to be broken into “parts,” which are then loaded into a distributed storage system made up of multiple nodes running on commodity hardware.
The Hadoop Distributed File System (HDFS) is the file system that enables these data parts to be stored on different machines (commodity hardware) in a cluster. HDFS therefore enables distributed storage.
A file system is a method of which a computer stores data so it can be found and used at any given time. It is usually determined by a computer’s operating system as different operating systems may have different distributed file systems.
In contrast, a Hadoop system uses its own file system which, in a sense, sits ‘above’ the file system of a computer. This means it can be accessed using any computer running any support operating system.
The Hadoop distributed file system takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS and then the HDFS creates multiple replicas of data blocks distributes them on compute nodes in a cluster. This method of distribution enables reliable and extremely fast computations.
One of the core properties of the HDFS is that each of the data parts is replicated multiple times and distributed across multiple nodes within the cluster. If one node fails, another node has a copy of that specific data package that can be used for processing. Due to this, data can still be processed and analysed even when one of the nodes fails due to a hardware failure. This makes HDFS and Hadoop a very robust system.
Since HDFS stores multiple copies of the data parts across different nodes in the cluster, it is very important to keep track of where the data parts are stored, and which nodes are available or have failed. The NameNode performs this task. It acts as a facilitator that communicates where data parts are stored and if they are available. The NameNode is the centrepiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Figure 2: How NameNode works
Once the data parts are stored across different nodes in the cluster, it can be processed. The MapReduce framework ensures that these tasks are completed by enabling the parallel distributed processing of the data parts across the multiple nodes in the cluster.
Figure 3: MapReduce diagram showcases how data is processed with the MapReduce function
The node that initiates the Map procedure is labelled the Job Tracker (discussed next). The Job Tracker then refers to the NameNode to determine which data is needed to execute the request and where the data parts are located in the cluster. Once the location of necessary data parts is established, the Job Tracker submits the query to the individual nodes, where they are processed. The processing thus takes place locally within each node, establishing the key characteristic of distributed processing.
The Job Tracker is the node in the cluster that initiates and coordinates processing jobs. Additionally, the Job Tracker invokes the Map procedure and the Reduce method.
Figure 4: Diagram further explaining the Job Tracker function in MapReduce
The first operation of the MapReduce framework is to perform a “Map” procedure. One of the nodes in the cluster requests the Map procedure – usually in the form of a Java query – in order to process some data.
The second operation of the MapReduce framework is to execute the “Reduce” method. This operation happens after processing. When the Reduce job is executed, the Job Tracker will locate the local results (from the Map procedure) and aggregate these components together into a single final result.
This final result is the answer to the original query and can be loaded into any number of analytics and visualisation environments.
Simply put, MapReduce reads data from the database and puts into a format suitable for analysis (map) and performs mathematical operations i.e. counting the number of customers aged 20+ in a database (reduce).
3. Hadoop Common
The third module is the Hadoop Common. This is the module that provides the tools needed for the user’s computer systems to read data stored under the Hadoop File system. Like the other modules, Hadoop common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop framework. It can be considered that the Hadoop Common module provides the base/foundation of the framework as it provides the essential services and procedures for the framework.
The last of the four modules is YARN. Yarn stands for ‘Yet Another Resource Negotiator’. This is the module that manages the resources of the systems storing the data and running the analysis. It is the resource management and job scheduling technology in the Hadoop distributed processing framework. It is responsible for allocating system resources to the various applications running in a Hadoop Cluster and scheduling tasks to be executed on different cluster nodes.
135 terabytes of compressed data is processed daily and 4 terabytes of compressed data is added daily. Facebook has a Hadoop warehouse with two level network topology having 4800 cores, 5.5 PB storing up to 12 terabytes per node. 7500+ Hadoop hive jobs run in production cluster per day with an average of 80,000 compute hours. Non-engineers i.e. analysts at Facebook use Hadoop through hive and approximately 200 people/month run jobs on Apache Hadoop.