CentOS HDFS Deployment Guide

This guide provides a step-by-step approach to deploying HDFS (Hadoop Distributed File System) on CentOS, covering both standalone and cluster setups. Follow these steps to set up a robust distributed file system.

Prerequisites

Before starting, ensure the following requirements are met:

Operating System: CentOS 7 or later.
Java Environment: Hadoop requires Java 8 (OpenJDK or Oracle JDK).
Network Configuration: All nodes (NameNode, DataNodes) must be able to communicate via hostname/IP. Update /etc/hosts with node details (e.g., 192.168.1.10 namenode, 192.168.1.11 datanode1).

Firewall: Open required ports (e.g., 9000 for NameNode RPC, 50070 for Web UI, 50010 for DataNode data transfer). Use firewall-cmd to configure:

sudo firewall-cmd --permanent --zone=public --add-port=9000/tcp
sudo firewall-cmd --permanent --zone=public --add-port=50070/tcp
sudo firewall-cmd --reload

SSH: Configure passwordless SSH between the NameNode and DataNodes for seamless communication. Generate keys on the NameNode and copy them to DataNodes:
```
ssh-keygen -t rsa
ssh-copy-id datanode1
ssh-copy-id datanode2
```

Step 1: Install Java

Hadoop depends on Java. Install OpenJDK 8 using yum:

sudo yum install -y java-1.8.0-openjdk-devel

Verify installation:

java -version

Ensure the output shows Java 1.8.0.

Step 2: Download and Extract Hadoop

Download the latest stable Hadoop release from the Apache website. For example, to download Hadoop 3.3.4:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

Extract the tarball to /usr/local and rename the directory for simplicity:

sudo tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/
sudo mv /usr/local/hadoop-3.3.4 /usr/local/hadoop

Step 3: Configure Hadoop Environment Variables

Set up environment variables to make Hadoop commands accessible globally. Create a new file /etc/profile.d/hadoop.sh:

sudo nano /etc/profile.d/hadoop.sh

Add the following lines (adjust paths if Hadoop is installed elsewhere):

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Make the file executable and apply changes:

sudo chmod +x /etc/profile.d/hadoop.sh
source /etc/profile.d/hadoop.sh

Verify Hadoop installation:

hadoop version

Step 4: Configure HDFS Core Files

Edit Hadoop configuration files in $HADOOP_HOME/etc/hadoop to define HDFS behavior.

4.1 core-site.xml

This file configures the default file system and NameNode address. Replace namenode with your NameNode’s hostname:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop/tmp</value>
    </property>
</configuration>

4.2 hdfs-site.xml

This file sets HDFS-specific parameters like replication factor and data directories. Create directories for NameNode and DataNode data:

sudo mkdir -p /usr/local/hadoop/data/namenode
sudo mkdir -p /usr/local/hadoop/data/datanode
sudo chown -R $(whoami):$(whoami) /usr/local/hadoop/data

Add the following configurations to hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value> <!-- Adjust based on your cluster size (e.g., 1 for standalone) -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop/data/datanode</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value> <!-- Disable permissions for testing (enable in production) -->
    </property>
</configuration>

Optional: mapred-site.xml and yarn-site.xml

If using YARN for resource management, configure these files:

mapred-site.xml (create if it doesn’t exist):

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>namenode</value> <!-- Replace with your ResourceManager hostname -->
    </property>
</configuration>

Step 5: Format the NameNode

The NameNode must be formatted before first use to initialize its storage. Run this command on the NameNode:

hdfs namenode -format

Follow the prompts to complete formatting. This step creates the necessary directory structure and metadata files.

Step 6: Start HDFS

Start the HDFS services using the start-dfs.sh script (run from the NameNode):

start-dfs.sh

Check the status of HDFS daemons with:

jps

You should see NameNode, DataNode, and SecondaryNameNode processes running.

Step 7: Verify HDFS

Confirm HDFS is operational by:

Web UI: Open a browser and navigate to http://<namenode-ip>:50070 (replace <namenode-ip> with your NameNode’s IP). You should see the HDFS dashboard with cluster information.
Command Line: List the root directory to verify HDFS is accessible:
```
hdfs dfs -ls /
```

Step 8: Stop HDFS (Optional)

To stop HDFS services, run:

stop-dfs.sh

Troubleshooting Tips

“Permission Denied” Errors: Ensure Hadoop directories have the correct ownership (chown -R hadoop:hadoop /usr/local/hadoop).
Port Conflicts: Verify no other services are using Hadoop ports (e.g., 9000, 50070) using netstat -tuln.
Daemon Failures: Check logs in $HADOOP_HOME/logs for errors (e.g., NameNode.log, DataNode.log).

By following these steps, you’ll have a fully functional HDFS deployment on CentOS, ready to store and manage large datasets in a distributed environment.

CentOS HDFS部署指南