Setting up Hadoop cluster on AWS

Creating Namenode, Datanode and Clientnode in AWS EC2 service.

What is Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

What is HDFS

HDFS means Hadoop Distributed File System. Hadoop works on the concept of Distributed Systems. HDFS uses concept of replication of data for high availability and high fault-tolerance. It can be easily deployed on low cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Hadoop Installation

Hadoop is written in java. To run hadoop, we need Java JDK to be installed. Various versions of hadoop are compatible only with particular versions of Java JDK. In this practical, I am using Java SE Development Kit 8u171 and Hadoop 1.2.1. First install Java and then install hadoop.

To install you may refer these links:

We use rpm -ivh command to install rpm softwares.

rpm -ivh jdk-8u171-linux-x64.rpmrpm -ivh hadoop-1.2.1–1.x86_64.rpm --force

Hadoop Architecture

Hadoop uses Master-Slave architecture. Hadoop consist of a Master Node called as NameNode. To this NameNode, several nodes are connected and these nodes act as storage. These are called DataNodes. To upload data into hadoop architecture,we need another type of node called ClientNode.

NameNode Configuration

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. All the data is stored only in DataNodes.

Here,I used an RedHat Linux EC2 Instance. In that instance,we need to install java and hadoop.

After installing, first we need to create a directory. Lets create a directory called /nn .In this directory all the necessary files related to NameNode are present. It contains the metadata of the cluster.Next we need to go to /etc/hadoop directory. There are two main files that we use for the configuration of any node. They are hdfs-site.xml and core-site.xml.

Now for NameNode, we need to edit hdfs-site.xml file. In this file,we need to create a property.This property contains two tags,one is name and the other is value.

For the name tag,as it is NameNode, wee should write dfs.name.dir. And we have to specify the directory we just made in value tag i.e., /nn .

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>
</configuration>

Now, we have to edit core-site.xml file. In this file also we have to write a property. The name tag consists of fs.default.name. In the value tag,if its a local cluster,we have to mention the master ip so that using this ip, all other nodes will connect. here we can give two ip’s. If we give master private ip,then only the nodes in that network are able to connect to master. If we want node from any network to connect to the master,we can use universal ip i.e., 0.0.0.0

So in value tag,we use hdfs://0.0.0.0:9001, where 9001 is port number we usually give for hadoop.

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property>
</configuration>

After that,we have to format the NameNode using the below command:

hadoop namenode -format

NOTE: No need to format DataNode because only NameNode consists the data of the actual data that is going to be stored.

Now,to start the NameNode,we use below command:

hadoop-daemon.sh start namenode

To check if namenode is started or not,you can run jps command.

DataNode Configuration

Configuring DataNode is similar to that of NameNode configuration.Fisrt we need to create a directory in datanode. This is the directory in which all data is going to be stored.Lets name the directory as /dn1.Then we have to go to /etc/hadoop folder in datanode and the open hdfs-site.xml file. Now,write a property with name tag as dfs.data.dir and its value tag as the directory name /dn1.

Now go to core-site.xml file. Create a property with name tag as fs.default.name and here,in the value tag, we need to enter the master end-point i.e., hdfs://<master-public-ip>:<port>. Here port is 9001.

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.data.dir</name>
<value>/dn1</value>
</property>
</configuration>

core-site.xml - (Replace your master ip in place of 15.206.166.252)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://15.206.166.252:9001</value>
</property>
</configuration>

After that,then we can start datanode. We can do it by running below cmd:

hadoop-daemon.sh start datanode

Also run jps command to check if it stated or not.

To check if datanode is connected to namenode or not,we use command:

hadoop dfsadmin -report

By running above command,it shows what all datanodes are connected and their info such as ip’s,storage available etc.

ClientNode Configuration

We use ClientNode to input data to the cluster. That means we use this node to read and write data to the hadoop cluster.In this node, we need install hadoop. And then we need to only edit the core-site.xml file. This file is same as that of datanode core-site.xml file. And thats it. No need to create directory or start any service.

Now,to upload data into cluster,we use put option. Suppose you have d1.txt file.To upload it,we use:

hadoop fs -put d1.txt /

As you can see from above picture,you can see all uploaded files using ls option.

We also have a WebUI for hadoop cluster. It runs on a different port. The url is in the format of <master-ip>:50070

When you click on Browse the filesystems option, you will find all files uploaded in cluster.

When you click on file,you can see the data present in it and in which node the file is uploaded.

In this way,you can create a hadoop cluster in AWS.

Important points

  • I used 3 EC2 instance for 3 nodes of redhat image

Thank You

Tech Explorer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store