# SQL Hive中的Select From語句使用詳解
## 1. Hive SQL基礎概述
Apache Hive是構建在Hadoop之上的數據倉庫工具,它提供了類似SQL的查詢語言——HiveQL(簡稱HQL),允許用戶對存儲在HDFS中的大規模數據集進行查詢和分析。其中`SELECT FROM`語句作為最基礎且核心的查詢語法,承擔著數據檢索的重要功能。
### 1.1 Hive與傳統SQL的異同
相同點:
- 語法結構高度相似
- 支持標準SQL操作(SELECT, JOIN, WHERE等)
- 使用表、列等關系型概念
不同點:
- Hive處理PB級數據,傳統SQL通常處理GB/TB級
- Hive執行延遲較高(分鐘級),傳統SQL是毫秒級響應
- Hive底層是MapReduce/Tez/Spark作業
## 2. 基本SELECT FROM語法結構
### 2.1 最簡查詢形式
```sql
SELECT column1, column2
FROM database_name.table_name;
示例:
SELECT employee_id, name
FROM hr.employees;
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
-- 選擇單列
SELECT product_name FROM products;
-- 選擇多列
SELECT product_id, category, price FROM products;
-- 選擇所有列(生產環境慎用)
SELECT * FROM sales_records;
-- 選擇特定前綴的列
SELECT sales.* FROM sales_2023 sales;
SELECT
user_id AS id,
registration_date AS reg_date
FROM users;
SELECT
product_id,
quantity,
price,
quantity * price AS total_amount
FROM order_items;
-- 數值比較
SELECT * FROM products WHERE price > 100;
-- 字符串比較
SELECT * FROM customers WHERE name = '張三';
-- 日期比較
SELECT * FROM orders WHERE order_date >= '2023-01-01';
-- AND 運算符
SELECT * FROM employees
WHERE department = 'Sales' AND salary > 5000;
-- OR 運算符
SELECT * FROM products
WHERE category = 'Electronics' OR category = 'Appliances';
-- NOT 運算符
SELECT * FROM customers
WHERE NOT (age BETWEEN 18 AND 30);
-- IN 操作符
SELECT * FROM products
WHERE category IN ('Books', 'Music', 'Movies');
-- BETWEEN 范圍查詢
SELECT * FROM transactions
WHERE amount BETWEEN 100 AND 1000;
-- LIKE 模糊匹配
SELECT * FROM users
WHERE email LIKE '%@gmail.com';
-- IS NULL 空值判斷
SELECT * FROM employees
WHERE manager_id IS NULL;
-- 默認升序(ASC)
SELECT * FROM products
ORDER BY price;
-- 顯式降序
SELECT * FROM sales
ORDER BY sale_date DESC;
SELECT employee_id, name, department, salary
FROM employees
ORDER BY department ASC, salary DESC;
-- 獲取價格最高的10個產品
SELECT * FROM products
ORDER BY price DESC
LIMIT 10;
-- 按部門統計員工數
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department;
常用聚合函數: - COUNT() - 計數 - SUM() - 求和 - AVG() - 平均值 - MAX()/MIN() - 最大/最小值 - VARIANCE()/STDDEV() - 方差/標準差
示例:
SELECT
product_category,
AVG(price) AS avg_price,
MAX(price) AS max_price,
COUNT(*) AS product_count
FROM products
GROUP BY product_category;
-- 篩選平均價格大于100的類別
SELECT
category,
AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING avg_price > 100;
-- 獲取唯一的部門列表
SELECT DISTINCT department FROM employees;
-- 多列去重
SELECT DISTINCT department, job_title FROM employees;
SELECT
product_id,
price,
CASE
WHEN price > 1000 THEN 'Premium'
WHEN price > 500 THEN 'Standard'
ELSE 'Budget'
END AS price_tier
FROM products;
-- 計算部門內薪資排名
SELECT
employee_id,
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
-- 查詢特定分區的數據
SELECT * FROM sales
WHERE dt = '2023-01-15' AND region = 'North';
-- 動態分區查詢
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- 創建分桶表
CREATE TABLE bucketed_users (
id INT,
name STRING
) CLUSTERED BY (id) INTO 4 BUCKETS;
-- 分桶查詢
SELECT * FROM bucketed_users
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
-- 數據塊抽樣
SELECT * FROM large_table TABLESAMPLE(10 PERCENT);
-- 行數抽樣
SELECT * FROM large_table TABLESAMPLE(100 ROWS);
SELECT *
SET hive.exec.parallel=true;
-- 分析每月銷售趨勢
SELECT
DATE_FORMAT(order_date, 'yyyy-MM') AS month,
COUNT(DISTINCT order_id) AS order_count,
SUM(amount) AS total_sales,
AVG(amount) AS avg_order_value
FROM orders
WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY DATE_FORMAT(order_date, 'yyyy-MM')
ORDER BY month;
-- 識別高價值用戶
SELECT
user_id,
COUNT(DISTINCT session_id) AS session_count,
SUM(CASE WHEN page_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
SUM(duration) AS total_time_spent
FROM user_sessions
GROUP BY user_id
HAVING purchase_count >= 3
ORDER BY total_time_spent DESC
LIMIT 100;
數據傾斜:使用SKEW JOIN
優化
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;
內存不足:調整Reducer數量
SET mapreduce.job.reduces=200;
小文件問題:合并輸出
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=256000000;
Hive的SELECT FROM語句作為數據查詢的基石,雖然語法與傳統SQL相似,但在大數據環境下有其獨特的優化方式和執行特性。掌握其使用方法和優化技巧,能夠顯著提升海量數據查詢的效率。隨著Hive版本的迭代,更多現代SQL特性如CTE、高級窗口函數等也被逐步支持,使得HiveQL在處理復雜分析場景時更加得心應手。
注意:實際使用時請根據具體Hive版本調整語法,本文示例基于Hive 3.x版本。 “`
這篇文章共計約2900字,全面介紹了Hive中SELECT FROM語句的使用方法,從基礎語法到高級特性,包含了大量實用示例和優化建議。內容采用Markdown格式,包含代碼塊、列表、標題等標準元素,適合技術文檔的編寫和閱讀。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。