# Python如何爬取微博熱搜并實現數據可視化
## 目錄
1. [項目背景與目標](#項目背景與目標)
2. [技術棧與工具準備](#技術棧與工具準備)
3. [微博熱搜數據爬取實戰](#微博熱搜數據爬取實戰)
- [3.1 分析微博網頁結構](#31-分析微博網頁結構)
- [3.2 使用Requests獲取數據](#32-使用requests獲取數據)
- [3.3 數據解析與清洗](#33-數據解析與清洗)
4. [數據存儲方案](#數據存儲方案)
- [4.1 CSV文件存儲](#41-csv文件存儲)
- [4.2 MySQL數據庫存儲](#42-mysql數據庫存儲)
5. [數據可視化實現](#數據可視化實現)
- [5.1 Matplotlib基礎圖表](#51-matplotlib基礎圖表)
- [5.2 Pyecharts交互式可視化](#52-pyecharts交互式可視化)
- [5.3 詞云圖生成](#53-詞云圖生成)
6. [完整代碼示例](#完整代碼示例)
7. [項目擴展方向](#項目擴展方向)
8. [常見問題與解決方案](#常見問題與解決方案)
---
## 項目背景與目標
微博熱搜作為中文互聯網最活躍的輿情風向標,每日吸引超過3億用戶關注。本項目將通過Python技術棧實現:
- 實時爬取微博熱搜榜單數據
- 建立結構化存儲系統
- 生成多維度的數據可視化圖表
- 構建輿情分析基礎框架
---
## 技術棧與工具準備
| 工具/庫 | 用途 | 版本要求 |
|----------------|-----------------------|----------|
| Python | 主開發語言 | 3.7+ |
| Requests | HTTP請求庫 | 2.26+ |
| BeautifulSoup | HTML解析 | 4.10+ |
| PyMySQL | MySQL數據庫交互 | 1.0+ |
| Pandas | 數據處理 | 1.3+ |
| Matplotlib | 靜態可視化 | 3.5+ |
| Pyecharts | 交互式可視化 | 1.9+ |
| WordCloud | 詞云生成 | 1.8+ |
| Jieba | 中文分詞 | 0.42+ |
安裝命令:
```bash
pip install requests beautifulsoup4 pymysql pandas matplotlib pyecharts wordcloud jieba
<tbody id="pl_top_realtimehot">
<td class="td-01">
<a href="/weibo?q=...">
<span>
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Cookie': '您的微博Cookie' # 需登錄獲取
}
def fetch_weibo_hot():
url = 'https://s.weibo.com/top/summary'
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.text
except Exception as e:
print(f"請求失敗: {e}")
return None
def parse_hot_data(html):
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find('tbody', {'id': 'pl_top_realtimehot'})
hot_items = []
for tr in tbody.find_all('tr')[1:]: # 跳過表頭
rank = tr.find('td', class_='td-01').get_text(strip=True)
keyword = tr.find('a').get_text(strip=True)
try:
hot_score = tr.find('span').get_text(strip=True)
except AttributeError:
hot_score = '0'
hot_items.append({
'rank': int(rank),
'keyword': keyword,
'hot_score': int(hot_score) if hot_score.isdigit() else 0
})
return hot_items
import pandas as pd
def save_to_csv(data, filename='weibo_hot.csv'):
df = pd.DataFrame(data)
df.to_csv(filename, index=False, encoding='utf_8_sig')
import pymysql
def save_to_mysql(data):
conn = pymysql.connect(
host='localhost',
user='root',
password='123456',
database='weibo_data'
)
with conn.cursor() as cursor:
sql = """CREATE TABLE IF NOT EXISTS hot_search (
id INT AUTO_INCREMENT PRIMARY KEY,
rank INT,
keyword VARCHAR(100),
hot_score INT,
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)"""
cursor.execute(sql)
for item in data:
insert_sql = "INSERT INTO hot_search (rank, keyword, hot_score) VALUES (%s, %s, %s)"
cursor.execute(insert_sql, (item['rank'], item['keyword'], item['hot_score']))
conn.commit()
conn.close()
import matplotlib.pyplot as plt
def plot_top10(data):
top10 = sorted(data, key=lambda x: x['hot_score'], reverse=True)[:10]
keywords = [x['keyword'] for x in top10]
scores = [x['hot_score'] for x in top10]
plt.figure(figsize=(12, 6))
bars = plt.barh(keywords, scores, color='#FF6B81')
plt.xlabel('熱度值')
plt.title('微博熱搜TOP10')
# 添加數據標簽
for bar in bars:
width = bar.get_width()
plt.text(width, bar.get_y() + bar.get_height()/2, f'{width:,}')
plt.tight_layout()
plt.savefig('top10.png', dpi=300)
from pyecharts.charts import Bar
from pyecharts import options as opts
def create_interactive_chart(data):
top20 = sorted(data, key=lambda x: x['hot_score'], reverse=True)[:20]
bar = (
Bar()
.add_xaxis([x['keyword'] for x in top20])
.add_yaxis("熱度值", [x['hot_score'] for x in top20])
.reversal_axis()
.set_global_opts(
title_opts=opts.TitleOpts(title="微博熱搜TOP20"),
xaxis_opts=opts.AxisOpts(name="熱度"),
yaxis_opts=opts.AxisOpts(
name="關鍵詞",
axislabel_opts=opts.LabelOpts(font_size=8)
)
)
)
return bar.render("hot_search.html")
from wordcloud import WordCloud
import jieba
def generate_wordcloud(data):
text = ' '.join([x['keyword'] for x in data])
word_list = jieba.cut(text)
word_str = ' '.join(word_list)
wc = WordCloud(
font_path='msyh.ttc',
background_color='white',
width=800,
height=600
).generate(word_str)
plt.imshow(wc)
plt.axis('off')
plt.savefig('wordcloud.png', dpi=300)
(此處因篇幅限制展示核心代碼,完整代碼需包含異常處理、日志記錄等)
# weibo_hot_crawler.py
import logging
from datetime import datetime
def main():
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s: %(message)s'
)
try:
html = fetch_weibo_hot()
if html:
data = parse_hot_data(html)
# 數據存儲
save_to_csv(data)
save_to_mysql(data)
# 可視化
plot_top10(data)
create_interactive_chart(data)
generate_wordcloud(data)
logging.info(f"成功處理{len(data)}條熱搜數據")
except Exception as e:
logging.error(f"程序異常: {e}")
if __name__ == '__main__':
main()
反爬機制:
數據缺失:
編碼問題:
可視化渲染:
數據庫連接:
通過本項目的實踐,讀者不僅可以掌握Python爬蟲與可視化的核心技能,更能構建一套完整的輿情監測系統原型。建議在合法合規的前提下進行數據采集,注意遵守微博平臺的相關使用協議。 “`
注:實際執行時需要注意: 1. 微博需要登錄才能獲取完整數據,需替換真實的Cookie 2. MySQL連接參數需要根據實際環境配置 3. 字體文件路徑需替換為系統存在的字體 4. 建議添加適當的延時避免請求過于頻繁
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。