在Ubuntu上使用Hadoop Streaming可以讓你利用Python、Ruby、PHP等非Java語言編寫MapReduce程序。以下是詳細的步驟指南:
sudo apt-get update
sudo apt-get install python3 python3-pip
假設我們要編寫一個簡單的WordCount程序。
Mapper (mapper.py):
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")
Reducer (reducer.py):
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_count = count
current_word = word
if current_word == word:
print(f"{current_word}\t{current_count}")
確保這兩個腳本都有執行權限:
chmod +x mapper.py
chmod +x reducer.py
將你的輸入數據上傳到HDFS中:
hdfs dfs -mkdir /input
hdfs dfs -put /path/to/your/local/input/file.txt /input
使用hadoop jar命令來運行Hadoop Streaming作業:
hadoop jar /path/to/hadoop-streaming.jar \
-files mapper.py,reducer.py \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input /input/file.txt \
-output /output
-files:指定要傳遞給MapReduce任務的文件。-mapper:指定Mapper腳本。-reducer:指定Reducer腳本。-input:指定輸入數據的HDFS路徑。-output:指定輸出數據的HDFS路徑。作業完成后,你可以查看輸出結果:
hdfs dfs -cat /output/part-r-00000
hadoop-streaming.jar的路徑正確。通過以上步驟,你就可以在Ubuntu上使用Hadoop Streaming來運行非Java編寫的MapReduce程序了。