Configure Hadoop and start cluster services using Ansible Playbook.

Prince Raj
3 min readDec 12, 2020

For making HDFS Cluster
🔸Install JDK Software
🔸Install HADOOP Software
🔸Configure Namenode
🔸Configure Datanode
🔸Format Namenode Directory
🔸Start Both Node Services.

first, at all, we write code in ansible playbook about datanode and namenode separately

Ans after writing the code run ansible playbook for namenode first

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. … The NameNode is a Single Point of Failure for the HDFS Cluster.

  • NameNode is the centerpiece of HDFS.
  • NameNode is also known as the Master
  • NameNode only stores the metadata of HDFS — the directory tree of all files in the file system, and tracks the files across the cluster.
  • NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
  • NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
  • NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down.
  • NameNode is a single point of failure in Hadoop cluster.
  • NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.

Datanode:-

DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability.

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.

DataNode instances can talk to each other, which is what they do when they are replicating data.

  • There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
  • An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
  • Avoid using NFS for data storage in production system.

--

--