在Ubuntu上使用Hadoop Streaming可以讓你利用Python、Ruby、PHP等非Java語言編寫MapReduce程序。以下是詳細的步驟指南:
sudo apt-get update
sudo apt-get install python3 python3-pip
假設我們要編寫一個簡單的WordCount程序。
Mapper (mapper.py):
#!/usr/bin/env python3
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print(f"{word}\t1")
Reducer (reducer.py):
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    count = int(count)
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_count = count
        current_word = word
if current_word == word:
    print(f"{current_word}\t{current_count}")
確保這兩個腳本都有執行權限:
chmod +x mapper.py
chmod +x reducer.py
將你的輸入數據上傳到HDFS中:
hdfs dfs -mkdir /input
hdfs dfs -put /path/to/your/local/input/file.txt /input
使用hadoop jar命令來運行Hadoop Streaming作業:
hadoop jar /path/to/hadoop-streaming.jar \
    -files mapper.py,reducer.py \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -input /input/file.txt \
    -output /output
-files:指定要傳遞給MapReduce任務的文件。-mapper:指定Mapper腳本。-reducer:指定Reducer腳本。-input:指定輸入數據的HDFS路徑。-output:指定輸出數據的HDFS路徑。作業完成后,你可以查看輸出結果:
hdfs dfs -cat /output/part-r-00000
hadoop-streaming.jar的路徑正確。通過以上步驟,你就可以在Ubuntu上使用Hadoop Streaming來運行非Java編寫的MapReduce程序了。