1. Introduction

Apache Hadoop is an open-source software framework that efficiently stores and processes large datasets ranging from gigabytes to petabytes. Additionally, it utilizes a distributed file system called Hadoop Distributed File System (HDFS) to store data and employs the MapReduce programming model for data processing.

In this tutorial, we’ll walk through the step-by-step process of installing and configuring Hadoop on a Linux system. In particular, we’ll discuss everything from downloading the necessary files to setting up the required dependencies. Finally, we’ll access Hadoop via a Web interface.

2. Installing Java

To install Hadoop successfully, we first obtain and install the Java Development Kit (JDK) on the Linux system. Additionally, ensuring that the system is up-to-date is a good practice before proceeding with the installation.

2.1. Updating System Packages

First, let’s start by updating the system to fetch the latest package information from all configured sources:

$ sudo apt update && sudo apt upgrade -y

The update subcommand fetches the latest package information from all configured sources. Furthermore, the whole command updates the package lists for upgrades and installs any available upgrades.

2.2. Installing an Open-Source Java Development Kit (JDK)

After updating the system, we proceed to install the JDK package. In particular, we utilize an open-source package.

For an open-source JDK, let’s use the default-jdk package, which is a meta package for OpenJDK in most Debian-based systems:

$ sudo apt install default-jdk

This command starts the installation of OpenJDK. Additionally, the default-jdk package includes a comprehensive set of tools and libraries essential for Java development.

Once the installation is complete, we verify the installed version of Java on the system:

$ java -version
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
java version "21.0.1" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing)

The output of this command displays the Java version, release date, and information about the JRE (Java Runtime Environment).

Additionally, with the JDK successfully installed, we now create a hadoop user and configure password-less SSH access.

3. Create hadoop User

Moving on, we create a dedicated user for Hadoop. In particular, this user is used to run Hadoop services and perform administrative tasks.

To add a new user called hadoop, we use the adduser command:

$ sudo adduser hadoop

Furthermore, to grant administrative privileges to the hadoop user, we add them to the sudo group:

$ sudo usermod -aG sudo hadoop

Finally, we switch to the newly created hadoop user:

$ sudo su - hadoop

By switching to the hadoop user, we ensure that all subsequent commands are executed with the appropriate permissions.

4. Configure Password-less SSH Access

To enable seamless communication between nodes in a Hadoop cluster, let’s configure password-less SSH access. Specifically, doing so enables the Hadoop daemons to securely communicate with each other without requiring manual authentication.

4.1. Install OpenSSH

First, we install the OpenSSH server and client:

$ apt install openssh-server openssh-client -y

This command installs OpenSSH on the Linux system and ensures the server and client components are available.

Additionally, by default, the OpenSSH service is started automatically after installation. Let’s verify the SSH service status:

$ systemctl status ssh
● sshd.service - OpenSSH server daemon
     Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; preset: disabled)
     Active: (running) since Mon 2024-03-18 15:04 WAT; 1h 1min ago
       Docs: man:sshd(8) man:sshd_config(5)
     Main PID: 4308 (sshd)
      Tasks: 1 (limit: 18957)
     Memory: 2.2M CPU: 38ms
    CGroup: /system.slice/sshd.service
 └─4308 "sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups"

This command shows the sshd service is active and running.

4.2. Generate SSH Key Pair

Next, to generate an SSH key, we use the ssh-keygen command:

$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
/home/hdoop/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hdoop/.ssh/id_rsa
Your public key has been saved in /home/hdoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:09K2smH7S2nmCAG8PpYM7jkNVCvYJ8HbDtWbO7U+Hes hdoop@kali
The key's randomart image is:
+---[RSA 3072]----+
| .   .           |
|  o.o .          |
| o *o. o         |
|. B +oo .o       |
| ..*. .oS.+      |
| ..+..o..+.o     |
|  .o* .o+.*o     |
| ..o.. oo@o      |
|  o.    ++E.     |
+----[SHA256]-----+

This command generates an RSA key pair with an empty passphrase. The private key is saved in the ~/.ssh/id_rsa file, while the public key is stored in the ~/.ssh/id_rsa.pub file.

4.3. Copy Public Key

Further, we copy the contents of the public key to the authorized_keys file on the local machine:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

This command copies id_rsa.pub to the ~/.ssh/authorized_keys file.

4.4. Configure SSH Permissions

Then, to ensure proper access, we modify the permissions of the ~/.ssh directory and the authorized_keys file:

$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/authorized_keys

With the SSH key pair generated and the permissions set correctly, we have successfully configured password-less SSH access.

4.5. Configure Firewall for SSH Access

In some cases, we may need to adjust firewall settings to allow SSH access from outside the local machine. This step is crucial for enabling remote connections to the Hadoop cluster.

For example, if we use Uncomplicated Firewall (UFW), we can open port 22 fairly easily:

$ sudo ufw allow ssh

Alternatively, if we use another firewall management tool or directly manipulate firewall rules, we usually need to ensure that inbound connections to port 22 are permitted.

4.6. Testing SSH Connection

Finally, to test the SSH connection, we leverage the ssh command:

$ ssh localhost
Linux kali 6.6.9-amd64 #1 SMP PREEMPT_DYNAMIC Kali 6.6.9-1kali1 (2024-01-08) x86_64

The programs included with the Kali GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
...

The command shows SSH is properly configured for the Linux user we created earlier.

5. Install Apache Hadoop

Now, let’s proceed with the installation of Apache Hadoop on the Linux system.

First, we log in as the hadoop user with the su command:

$ sudo su - hadoop

This command ensures that all subsequent commands are executed with the appropriate permissions.

5.1. Download and Extract Hadoop

Then, we download the latest stable version of Hadoop from the official Apache Hadoop download page using the wget command:

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once the download is complete, we extract the downloaded file with the tar command:

$ tar -xvzf hadoop-3.3.6.tar.gz

The command extracts the contents of the downloaded file.

5.2. Move Hadoop to the Installation Directory

Next, we move the extracted contents of the Hadoop package to the /usr/local/hadoop installation directory:

$ sudo mv hadoop-3.3.6 /usr/local/hadoop

Then, let’s create a directory to store system logs for Hadoop:

$ sudo mkdir /usr/local/hadoop/logs

Finally, we change the ownership of the Hadoop directory:

$ sudo chown -R hadoop:hadoop /usr/local/hadoop

Next, we proceed to set the Hadoop environment variables.

5.3. Configure Hadoop Environment Variables

To configure the Hadoop environment variables, let’s open the ~/.bashrc file with a text editor:

$ sudo nano ~/.bashrc

Then, we add several lines to the end of the file:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Finally, we activate the updated environment variables:

$ source ~/.bashrc

By sourcing the script, we apply the changes made to the ~/.bashrc file to the current shell session.

6. Configure Java Environment Variables

To enable Hadoop to utilize its various components such as YARN, HDFS, MapReduce, and other related project settings, we define the Java environment variables in the hadoop-env.sh configuration file.

First, let’s find the path to the Java compiler (javac):

$ which javac
/usr/bin/javac

Additionally, we use the readlink command to determine the directory OpenJDK is located:

$ readlink -f /usr/bin/javac
/usr/lib/jvm/jdk-21-oracle-x64/bin/javac

Next, we edit the hadoop-env.sh file to define the Java environment variables:

$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

In particular, we add a couple of lines to the file:

export JAVA_HOME=/usr/lib/jvm/jdk-21-oracle-x64/bin/javac
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

Finally, we save and close the text editor.

7. Configure Hadoop Environment

Before starting the Hadoop cluster, it’s essential to configure the Hadoop environment to define various settings and parameters.

7.1. Download Required Libraries

We start by navigating to the Hadoop lib directory:

$ cd /usr/local/hadoop/lib

Then, we use the wget command to download the Javax activation file:

$ sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

Once the download is complete, let’s verify the installed Hadoop version:

$ hadoop version
Hadoop version: 3.3.6
Subversion revision: rXXXXXXX
Compiled by: username on date at time
Compiled with flags: ...

The result shows the installed version of Hadoop is version 3.3.6.

7.2. Edit core-site.xml Configuration

Next, let’s edit the core-site.xml configuration file to specify the URL for the NameNode:

$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Then, we add a few lines to the file and save it:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://0.0.0.0:9000</value>
        <description>The default file system URI</description>
    </property>
</configuration>

Additionally, let’s create a directory for storing node metadata and change the ownership to hadoop:

$ sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
$ sudo chown -R hadoop:hadoop /home/hadoop/hdfs

This command successfully creates a directory that stores node metadata and then changes the ownership over that directory.

7.3. Edit hdfs-site.xml Configuration

Moving forward, let’s also edit the hdfs-site.xml configuration file to define the location for storing node metadata and the replication factor:

$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Then, we add some lines to the file and save it:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <property>
        <name>dfs.name.dir</name>
        <value>file:///home/hadoop/hdfs/namenode</value>
    </property>

    <property>
        <name>dfs.data.dir</name>
        <value>file:///home/hadoop/hdfs/datanode</value>
    </property>
</configuration>

Thus, we use the paths we created by setting them in the appropriate configuration.

7.4. Edit mapred-site.xml Configuration

Additionally, we also edit the mapred-site.xml configuration file to define MapReduce values:

$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Again, we add a few lines to the file and save it:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

This way, we ensure that YARN is used as the framework.

7.5. Edit yarn-site.xml Configuration

Lastly, let’s edit the yarn-site.xml configuration file and define YARN-related settings:

$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Then, we add a few lines to the file and save it:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

This configuration establishes the auxiliary services of YARN.

7.6. Validate Hadoop Configuration

Still logged in as the Hadoop user, we validate the Hadoop configuration and format the HDFS NameNode:

$ hdfs namenode -format

This command erases any existing data in the NameNode, essentially initializing a new HDFS filesystem. As expected, it also removes the filesystem metadata information along with files, directories, and block locations.

8. Start the Apache Hadoop Cluster

To successfully start Apache Hadoop, we start all the clusters.

8.1. Start the NameNode and DataNode

First, let’s start the NameNode and DataNode via the start-dfs.sh script:

$ start-dfs.sh
Starting namenode on [namenode_host]... started
Starting secondarynamenode on [secondarynamenode_host]... started
Starting datanode on [datanode_host1]... started
... (similar messages for all datanodes)

This command initiates the HDFS cluster by starting the NameNode and DataNode services.

8.2. Start the YARN Resource and Node Managers

Next, we start the YARN resource and node managers:

$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /path/to/yarn-resourcemanager.log ... started
starting nodemanager on [nodemanager_host1]... started
... (similar messages for all nodemanagers)

This command launches the YARN resource manager and node manager, enabling the cluster to manage resource allocation and job execution.

8.3. Verify All Running Components

After we start the various components of the Apache Hadoop cluster, it’s important to verify that they’re running as expected.

The Java Virtual Machine Process Status Tool (jps) command lists all Java processes running on the system. In the context of Hadoop, running jps provides a convenient way to check whether the Hadoop components are up and running:

$ jps
3214 SecondaryNameNode
4320 Jps
3854 Resourcemanager
3456 DataNode
4084 NodeManager
3274 NameNode

As expected, the output of this command provides a list of all the Java processes running on the system, including the Hadoop components.

Finally, we can open a Web browser to access Hadoop’s NameNode (http://localhost:9870) and ResourceManager (http://localhost:8088) interfaces.

9. Conclusion

In this article, we successfully installed, configured, and started an Apache Hadoop cluster on a Linux system. Furthermore, by following the step-by-step process, we have set up the NameNode, DataNode, YARN resource, and node managers.