# Python中如何把Excel文件讀取為DF并用SQL查詢、分析
## 引言
在數據分析領域,Excel文件是最常見的數據源之一,而Python的pandas庫和SQL查詢的結合能夠提供強大的數據處理能力。本文將詳細介紹如何:
1. 使用pandas讀取Excel文件為DataFrame
2. 通過SQL語法查詢DataFrame數據
3. 進行復雜的數據分析操作
---
## 一、環境準備
### 1.1 安裝必要庫
```bash
pip install pandas openpyxl sqlalchemy
pandas
: 數據處理核心庫openpyxl
: 處理xlsx格式的Excel文件sqlalchemy
: 提供SQL查詢接口創建示例Excel文件sales_data.xlsx
,包含以下字段:
order_id | product | quantity | unit_price | order_date |
---|---|---|---|---|
1001 | Laptop | 2 | 1200 | 2023-01-15 |
import pandas as pd
# 讀取整個Excel文件
df = pd.read_excel('sales_data.xlsx', engine='openpyxl')
# 讀取特定工作表
df = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1')
# 指定列數據類型
dtype = {'order_id': str, 'quantity': float}
df = pd.read_excel('sales_data.xlsx', dtype=dtype)
# 處理空值
df = pd.read_excel('sales_data.xlsx', na_values=['NA', 'NULL'])
from pandasql import sqldf
# 基本查詢
query = "SELECT product, SUM(quantity) as total_qty FROM df GROUP BY product"
result = sqldf(query, globals())
# 多表連接(需提前加載多個DF)
query = """
SELECT a.product, b.category
FROM df a
JOIN product_info b ON a.product = b.product_name
"""
from sqlalchemy import create_engine
# 創建內存數據庫
engine = create_engine('sqlite:///:memory:')
# 將DataFrame存入臨時表
df.to_sql('sales', engine, index=False)
# 執行SQL查詢
query = """
SELECT strftime('%Y-%m', order_date) as month,
SUM(quantity * unit_price) as revenue
FROM sales
GROUP BY month
"""
result = pd.read_sql(query, engine)
query = """
SELECT
product,
SUM(quantity) as total_units,
SUM(quantity * unit_price) as total_revenue,
AVG(unit_price) as avg_price
FROM sales
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY product
ORDER BY total_revenue DESC
"""
query = """
WITH monthly_stats AS (
SELECT
strftime('%Y-%m', order_date) as month,
COUNT(DISTINCT order_id) as orders_count
FROM sales
GROUP BY month
)
SELECT
month,
orders_count,
LAG(orders_count, 1) OVER (ORDER BY month) as prev_month,
(orders_count - LAG(orders_count, 1) OVER (ORDER BY month)) /
LAG(orders_count, 1) OVER (ORDER BY month) as growth_rate
FROM monthly_stats
"""
# 讀取時只加載必要列
cols = ['product', 'quantity', 'order_date']
df = pd.read_excel('sales_data.xlsx', usecols=cols)
# 提前轉換日期類型
df['order_date'] = pd.to_datetime(df['order_date'])
# 創建數據庫索引
with engine.connect() as conn:
conn.execute("CREATE INDEX idx_product ON sales(product)")
# 使用CTE代替子查詢
query = """
WITH high_value_orders AS (
SELECT order_id
FROM sales
WHERE quantity * unit_price > 1000
)
SELECT * FROM sales WHERE order_id IN high_value_orders
"""
import pandas as pd
from sqlalchemy import create_engine
# 讀取數據
df = pd.read_excel('sales_data.xlsx', parse_dates=['order_date'])
# 創建SQL引擎
engine = create_engine('sqlite:///:memory:')
df.to_sql('sales', engine, index=False)
# 執行復雜分析
query = """
SELECT
product,
SUM(quantity) as total_units,
SUM(quantity * unit_price) as total_revenue,
ROUND(SUM(quantity * unit_price) / SUM(quantity), 2) as effective_price
FROM sales
GROUP BY product
HAVING total_revenue > 5000
ORDER BY total_revenue DESC
"""
result = pd.read_sql(query, engine)
result.to_excel('analysis_result.xlsx', index=False)
通過結合pandas的Excel處理能力和SQL的查詢靈活性,我們可以: 1. 輕松處理百萬行級別的Excel數據 2. 復用現有的SQL知識進行復雜分析 3. 輸出結果到新的Excel文件或其他格式
這種工作流特別適合需要定期處理Excel報表的數據分析師,既保留了Excel的易用性,又獲得了SQL的強大查詢能力。 “`
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。