Hadoop Introduction

Hadoop

Mar 30, 2015

In this article, we’ll see Hadoop Introduction

Table of Contents

What is Hadoop?

It is a platform for storing and processing large amounts of data in a distributed and scalable fashion.

Based on the two well-known Google papers on MapReduce and Google File System, It was originally created by two engineers at Yahoo. Individuals and businesses are using Hadoop as part of their analytics pipelines to discover customer behaviors and business insights previously hidden in mountains of data.

Hadoop can be broken up into two main systems: storage and computation. Each system is organized in a master-slave configuration with a “single” master and several slave nodes.

Storage

Storage in It is handled by the Hadoop Distributed File System (HDFS) which coordinates storage on several machines to appear and act as a single storage device.

There are two primary components: NameNode and DataNodes.

The NameNode is responsible for keeping all metadata about the filesystem such as the file and directory structure as well as which DataNodes have which blocks (blocks are fixed-size pieces of data and a file in HDFS may be stored as one or more blocks). DataNodes organize the actual blocks of data and communicate with each other and the NameNode to respond to queries and during data replication.

Computation

With the advent of MapReduce 2 (MR2) the previously monolithic functionality of MapReduce (MR1) has been separated out into resource management (YARN) and the actual computation (Applications/MapReduce).

MapReduce has been reimplemented to run on top of YARN as an application and most users won’t notice any difference.

YARN is structured similarly to HDFS, with a central ResourceManager and several slave NodeManagers.

The ResourceManager has two responsibilities: accepting applications and scheduling their execution with respect to the available computation resources and those required by the application. The NodeManager is analogous to the DataNode and manages execution of application tasks on each machine.

Choosing a Hadoop distribution

Due to its utility and popularity, there is much business now trying to make It more accessible and easier to manage. Three of the well-known options are Cloudera, Hortonworks, and Amazon Elastic MapReduce (EMR).

To the end-user, Cloudera and Hortonworks share a lot of similarities, they provide Hadoop distribution and management tools for you to install and run on your own machines.

Amazon EMR is different in that the cluster exists solely in EC2 and saves you from needing to manage/set up your own machines. The advantages of going with one of these options are obvious: they provide proprietary tools that abstract much of the setup and management costs of running a Hadoop cluster, and they also provide tight integration with other tools for ingesting and analyzing data.

The downside is becoming “locked” into a specific ecosystem; often there are bug fixes and new features that are available in It that take far longer to make it into each company’s implementation. That being said each one of these companies often makes contributions back to the Hadoop project and many of their employees are active members of the Hadoop community.

Why go with raw Apache Hadoop? The primary reasons are control and knowledge. Having to bring up a cluster will familiarize one with many aspects of Hadoop that get hidden when using a commercial distribution. If you want to use an experimental feature or recent bug fix, you can simply install it and not wait for it to be pulled into a specific distribution.

This guide will not cover in detail setting up Hadoop with any of these commercial distributions; each company already has excellent documentation for getting up and running with their flavor of Hadoop. However, if you’d like to get hands-on with Hadoop, read on.

Blog