Prerequisites for Hadoop HA on Debian
Before configuring Hadoop High Availability (HA), ensure the following prerequisites are met:
sudo apt install openjdk-11-jdk)./opt/hadoop) on all nodes.namenode1, journalnode1) and update /etc/hosts with IP-hostname mappings (e.g., 192.168.1.10 namenode1).ssh-keygen -t rsa) and distribute public keys (ssh-copy-id user@node-ip) to enable passwordless login.1. Configure ZooKeeper Cluster (Critical for Coordination)
ZooKeeper ensures consistent failover by managing locks and leader election for NameNode/ResourceManager.
sudo apt install zookeeper zookeeperd.zoo.cfg: Edit /etc/zookeeper/conf/zoo.cfg on all nodes to include server entries (replace 1,2,3 with node IDs and IPs):server.1=192.168.1.10:2888:3888
server.2=192.168.1.11:2888:3888
server.3=192.168.1.12:2888:3888
sudo systemctl start zookeeper on all nodes and verify status (sudo systemctl status zookeeper).hdfs zkfc -formatZK to create a znode for HA coordination.2. Configure HDFS High Availability (NameNode HA)
HDFS HA uses an Active/Passive NameNode pair with Quorum Journal Manager (QJM) for shared edits.
core-site.xml: Add default file system and ZooKeeper quorum:<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>192.168.1.10:2181,192.168.1.11:2181,192.168.1.12:2181</value>
</property>
hdfs-site.xml: Define NameNode roles, RPC addresses, shared edits, and failover settings:<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>192.168.1.10:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>192.168.1.11:8020</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://192.168.1.13:8485;192.168.1.14:8485;192.168.1.15:8485/mycluster</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
hadoop-daemon.sh start journalnode to store edit logs.namenode2), run hdfs namenode -bootstrapStandby to sync metadata from the active NameNode.hdfs namenode -format), then start HDFS (start-dfs.sh). Verify status with hdfs haadmin -getServiceState nn1 (should return “active”) and hdfs haadmin -getServiceState nn2 (should return “standby”).3. Configure YARN High Availability (ResourceManager HA)
YARN HA enables failover for ResourceManager, which schedules resources for applications.
yarn-site.xml: Enable ResourceManager HA and define roles:<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>192.168.1.10</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>192.168.1.11</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>192.168.1.10:2181,192.168.1.11:2181,192.168.1.12:2181</value>
</property>
yarn-daemon.sh start resourcemanager. Verify status with yarn rmadmin -getServiceState rm1 (active/passive).4. Configure Data Redundancy and Backup
Ensure data availability via replication and snapshots:
hdfs-site.xml, configure dfs.replication (default is 3) to store multiple copies of data blocks across nodes.dfs.namenode.snapshot.enabled=true to hdfs-site.xml to create point-in-time snapshots of HDFS directories.hdfs dfsadmin -saveNamespace to take periodic namespace backups and copy critical data to external storage.5. Set Up Monitoring and Alerting
Proactively monitor cluster health to detect failures early:
http://namenode1:9870) to track metrics like node status, disk usage, and job progress.6. Validate High Availability
Test failover to ensure automatic recovery:
hadoop-daemon.sh stop namenode on namenode1) and verify the standby becomes active (hdfs haadmin -getServiceState nn2 should return “active”).yarn-daemon.sh stop resourcemanager on resourcemanager1) and check the standby takes over (yarn rmadmin -getServiceState rm2 should return “active”).hdfs dfs -put /local/file /test) and access it after failover to confirm data integrity.