This guide provides a step-by-step approach to deploying HDFS (Hadoop Distributed File System) on CentOS, covering both standalone and cluster setups. Follow these steps to set up a robust distributed file system.
Before starting, ensure the following requirements are met:
/etc/hosts with node details (e.g., 192.168.1.10 namenode, 192.168.1.11 datanode1).firewall-cmd to configure:sudo firewall-cmd --permanent --zone=public --add-port=9000/tcp
sudo firewall-cmd --permanent --zone=public --add-port=50070/tcp
sudo firewall-cmd --reload
ssh-keygen -t rsa
ssh-copy-id datanode1
ssh-copy-id datanode2
Hadoop depends on Java. Install OpenJDK 8 using yum:
sudo yum install -y java-1.8.0-openjdk-devel
Verify installation:
java -version
Ensure the output shows Java 1.8.0.
Download the latest stable Hadoop release from the Apache website. For example, to download Hadoop 3.3.4:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
Extract the tarball to /usr/local and rename the directory for simplicity:
sudo tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/
sudo mv /usr/local/hadoop-3.3.4 /usr/local/hadoop
Set up environment variables to make Hadoop commands accessible globally. Create a new file /etc/profile.d/hadoop.sh:
sudo nano /etc/profile.d/hadoop.sh
Add the following lines (adjust paths if Hadoop is installed elsewhere):
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Make the file executable and apply changes:
sudo chmod +x /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh
Verify Hadoop installation:
hadoop version
Edit Hadoop configuration files in $HADOOP_HOME/etc/hadoop to define HDFS behavior.
This file configures the default file system and NameNode address. Replace namenode with your NameNode’s hostname:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
This file sets HDFS-specific parameters like replication factor and data directories. Create directories for NameNode and DataNode data:
sudo mkdir -p /usr/local/hadoop/data/namenode
sudo mkdir -p /usr/local/hadoop/data/datanode
sudo chown -R $(whoami):$(whoami) /usr/local/hadoop/data
Add the following configurations to hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value> <!-- Adjust based on your cluster size (e.g., 1 for standalone) -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value> <!-- Disable permissions for testing (enable in production) -->
</property>
</configuration>
If using YARN for resource management, configure these files:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>namenode</value> <!-- Replace with your ResourceManager hostname -->
</property>
</configuration>
The NameNode must be formatted before first use to initialize its storage. Run this command on the NameNode:
hdfs namenode -format
Follow the prompts to complete formatting. This step creates the necessary directory structure and metadata files.
Start the HDFS services using the start-dfs.sh script (run from the NameNode):
start-dfs.sh
Check the status of HDFS daemons with:
jps
You should see NameNode, DataNode, and SecondaryNameNode processes running.
Confirm HDFS is operational by:
http://<namenode-ip>:50070 (replace <namenode-ip> with your NameNode’s IP). You should see the HDFS dashboard with cluster information.hdfs dfs -ls /
To stop HDFS services, run:
stop-dfs.sh
chown -R hadoop:hadoop /usr/local/hadoop).netstat -tuln.$HADOOP_HOME/logs for errors (e.g., NameNode.log, DataNode.log).By following these steps, you’ll have a fully functional HDFS deployment on CentOS, ready to store and manage large datasets in a distributed environment.