How to configure Hadoop Cluster using Ansible?

Ansible -3

How to configure Hadoop Cluster using Ansible?

Shubham Rasal [SRE]
6 min readDec 17, 2020

--

In this article, we will configure the Hadoop name node and data node using ansible. — Shubham Rasal

Introduction

We are going to write playbooks for installing Hadoop and configuring both the name node and data nodes.

I am assuming you have installed ansible and set the inventory file path.

In this article, I have given high-level details about what is Hadoop and why it is used? Check it if you want to know more about Hadoop.

so without delay start writing our playbooks.

Action

Action mode 🔥

We will use ssh public key authentication to configure managed/target nodes. The motivation to use public-key authentication over simple passwords is security. Public key authentication provides cryptographic strength that even extremely long passwords can not offer. With SSH we also don’t need to remember long passwords.

So let’s create one ssh key using the below command for that…
Go to the .ssh folder

$ cd .ssh$ ssh-keygen -t rsa -b 4096

Now it will generate two files private and public keys in the .ssh file.

Now we have to copy the public key to the managed node for ssh authentication. for that we will use the ssh-copy-id command or else you can use the SCP command to authorized files also.

$ssh-copy-id -i ansible_key.pub username@managed_node_ip

It will ask first the time user password and enter the password then you are ready to go. above command append your id to ssh authorized file of a managed node.

In my case, my managed node IP address is 192.168.225.182.
You can check IP address using the $ ifconfig command

Now let’s test that we can connect manually to the managed node before moving to an ansible setup.

$ ssh username@hostname

using the above command you can test that we have successfully transferred the ssh public key or not. You can see that the managed node authorized_keys file has your generated public key.

Now we have set up ssh connection now we are ready to connect and configure the managed node using ansible.

Follow the same process for other nodes to get secure connections. In our case, we have two managed nodes one is the Hadoop name node and the data node.

Now create two groups in the inventory file and put IPs in those groups.

let’s test before moving forward do we have connectivity or not using the ping module.

yes… now we can move forward

Now we are all set to write the playbook … then let's start.

Now I have partitioned this task into two parts,
1. Installation
2. Configuration

The installation will be the same on both the machines and the configuration might have to change according to need.

This is a directory structure where hadoop-install.yml will have installation code and hadoop-configuration.yml will contains configuration code.

We have two directories 1. name node which will contain files for name node and 2. data node which will have data-node specific files.
and we have JDK and Hadoop rpm files which we are going to install.

Let's begin playbooks then installing Hadoop.
//hadoop-install.yml

The above playbook will detect the user directory and copy JDK and Hadoop and then if they are successfully transferred it will install them.

Now let’s run the above playbook

$ ansible-playbook hadoop-install.yml

As you can see now we have installed hadoop on both the name node and data node. Let’s confirm it on both the nodes that we have successfully installed or not.

yes, now we have installed hadoop… its time to configure the name node and data node.

now here's we have to declare few variables..

//vars.yml

This file now only one variable but you can add more and more variables that are common to both nodes here.

Configure Namenode:

Now we have a different vars.yml file for name node where we will store name node-specific variables.

we have to edit two files in name node 1. core-site.xml and hdfs-site.xml
we will make it dynamic using jinja templating.

above file will read name node IP from inventory file and port from global vars.yml file

above file is reading namenode_directory variable from we declared in namenode/vars.yml.

Configure Datanode:

//datanode/vars.yml

//datanode/core-site.xml

//datanode/hdfs-site.xml

now we have created templates and variables let's configure the nodes.

hadoop-configuration.yml

Here we are creating a directory and transferring templates and starting Hadoop service on both nodes according to their type

#to run only for name node
$ ansible-playbook hadoop-configuration.yml --tags namenode
#to run only for data node
$ ansible-playbook hadoop-configuration.yml --tags datanode
#to run full playbook
$ ansible-playbook hadoop-configuration.yml

And we are done… we have successfully completed the configuration setup for Hadoop using ansible.

You can check using the below output confirm.

You can find the above playbook on this GitHub repository. Bookmark or star it for future use.

If you have any doubts or something improvement needed in this blog, please feel free to reach out to me on my LinkedIn account.

I hope you learned something new and find ansible more interesting.
Let me know your thoughts about ansible and how do plan to use ansible?

Thank you.

About the writer:
Shubham loves technology, challenges, is open to learning and reinventing himself. He loves to share his knowledge. He is passionate about constant improvements.
He writes blogs about
Cloud Computing, Automation, DevOps, AWS, Infrastructure as code.
Visit his Medium home page to read more insights from him.

--

--