# Hadoop搭建及WordCount實例運行分析
## 一、Hadoop概述
### 1.1 Hadoop簡介
Hadoop是由Apache基金會開發的分布式系統基礎架構,核心設計包含:
- **HDFS**(Hadoop Distributed File System):分布式文件存儲系統
- **MapReduce**:分布式計算框架
- **YARN**:資源調度管理系統
### 1.2 核心優勢
- 高容錯性:自動維護數據多副本
- 高擴展性:可部署在廉價硬件上
- 高效性:并行處理PB級數據
- 高可靠性:自動故障轉移
## 二、Hadoop環境搭建
### 2.1 基礎環境準備
**硬件要求:**
- 至少3節點(1主2從)
- 每節點4GB內存+50GB磁盤
- 千兆網絡連接
**軟件要求:**
- JDK 1.8+
- SSH無密碼登錄
- Linux系統(推薦CentOS/Ubuntu)
### 2.2 詳細安裝步驟
#### 2.2.1 系統配置
```bash
# 關閉防火墻
systemctl stop firewalld
systemctl disable firewalld
# 設置主機名
hostnamectl set-hostname master
hostnamectl set-hostname slave1
hostnamectl set-hostname slave2
# 配置hosts文件
echo "192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2" >> /etc/hosts
tar -zxvf jdk-8u341-linux-x64.tar.gz -C /usr/local/
echo 'export JAVA_HOME=/usr/local/jdk1.8.0_341
export PATH=$PATH:$JAVA_HOME/bin' >> /etc/profile
source /etc/profile
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -zxvf hadoop-3.3.4.tar.gz -C /usr/local/
mv /usr/local/hadoop-3.3.4 /usr/local/hadoop
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hadoop/namenode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
# 格式化HDFS
hdfs namenode -format
# 啟動服務
start-dfs.sh
start-yarn.sh
# 驗證服務
jps
# 應顯示:NameNode/DataNode/ResourceManager/NodeManager
WordCount是MapReduce的”Hello World”,處理流程:
1. InputSplit:將輸入文件分片
2. Map階段:提取
public class WordCount {
// Mapper實現
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// Reducer實現
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
echo "Hello World Hello Hadoop" > input.txt
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input
hadoop jar wordcount.jar WordCount /input /output
hdfs dfs -cat /output/part-r-00000
# 輸出示例:
# Hadoop 1
# Hello 2
# World 1
<!-- mapred-site.xml優化示例 -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
組件 | 占比 | 說明 |
---|---|---|
Map任務 | 40-60% | CPU密集型任務 |
Reduce任務 | 20-30% | 需要更多網絡資源 |
系統預留 | 10-20% | OS和其他服務 |
~/.ssh/authorized_keys
權限應為600ssh master
能否無密碼登錄/usr/local/hadoop/logs/hadoop-*-namenode-*.log
# 查看任務狀態
yarn application -list
# 殺死任務
yarn application -kill application_123456789_0001
Hadoop作為大數據生態基石,其核心價值體現在: 1. 實現了廉價硬件的規?;嬎?2. 提供了可靠的數據存儲方案 3. 開創了分布式計算范式
未來發展趨勢: - 與云原生技術融合(Kubernetes調度) - 實時計算能力增強(Flink集成) - 機器學習生態完善(TensorFlow on YARN)
注:本文基于Hadoop 3.3.4版本驗證,完整實驗代碼和配置文件可參考GitHub示例倉庫 “`
(全文約2850字,實際字數可能因格式調整略有變化)
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。