python如何爬取微博熱搜并實現數據可視化

發布時間：2022-01-13 15:52:04 來源：億速云閱讀：1139 作者：小新欄目：大數據

# Python如何爬取微博熱搜并實現數據可視化

## 目錄
1. [項目背景與目標](#項目背景與目標)
2. [技術棧與工具準備](#技術棧與工具準備)
3. [微博熱搜數據爬取實戰](#微博熱搜數據爬取實戰)
   - [3.1 分析微博網頁結構](#31-分析微博網頁結構)
   - [3.2 使用Requests獲取數據](#32-使用requests獲取數據)
   - [3.3 數據解析與清洗](#33-數據解析與清洗)
4. [數據存儲方案](#數據存儲方案)
   - [4.1 CSV文件存儲](#41-csv文件存儲)
   - [4.2 MySQL數據庫存儲](#42-mysql數據庫存儲)
5. [數據可視化實現](#數據可視化實現)
   - [5.1 Matplotlib基礎圖表](#51-matplotlib基礎圖表)
   - [5.2 Pyecharts交互式可視化](#52-pyecharts交互式可視化)
   - [5.3 詞云圖生成](#53-詞云圖生成)
6. [完整代碼示例](#完整代碼示例)
7. [項目擴展方向](#項目擴展方向)
8. [常見問題與解決方案](#常見問題與解決方案)

---

## 項目背景與目標
微博熱搜作為中文互聯網最活躍的輿情風向標，每日吸引超過3億用戶關注。本項目將通過Python技術棧實現：
- 實時爬取微博熱搜榜單數據
- 建立結構化存儲系統
- 生成多維度的數據可視化圖表
- 構建輿情分析基礎框架

---

## 技術棧與工具準備
| 工具/庫        | 用途                  | 版本要求 |
|----------------|-----------------------|----------|
| Python         | 主開發語言            | 3.7+     |
| Requests       | HTTP請求庫            | 2.26+    |
| BeautifulSoup  | HTML解析              | 4.10+    |
| PyMySQL        | MySQL數據庫交互       | 1.0+     |
| Pandas         | 數據處理              | 1.3+     |
| Matplotlib     | 靜態可視化            | 3.5+     |
| Pyecharts      | 交互式可視化          | 1.9+     |
| WordCloud      | 詞云生成              | 1.8+     |
| Jieba          | 中文分詞              | 0.42+    |

安裝命令：
```bash
pip install requests beautifulsoup4 pymysql pandas matplotlib pyecharts wordcloud jieba

微博熱搜數據爬取實戰

3.1 分析微博網頁結構

打開微博熱搜頁面：https://s.weibo.com/top/summary
使用Chrome開發者工具（F12）檢查元素
關鍵發現：
- 熱搜列表位于<tbody id="pl_top_realtimehot">
- 每條熱搜包含：
  - 排名：<td class="td-01">
  - 關鍵詞：<a href="/weibo?q=...">
  - 熱度值：<span>

3.2 使用Requests獲取數據

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Cookie': '您的微博Cookie'  # 需登錄獲取
}

def fetch_weibo_hot():
    url = 'https://s.weibo.com/top/summary'
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"請求失敗: {e}")
        return None

3.3 數據解析與清洗

def parse_hot_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    tbody = soup.find('tbody', {'id': 'pl_top_realtimehot'})
    
    hot_items = []
    for tr in tbody.find_all('tr')[1:]:  # 跳過表頭
        rank = tr.find('td', class_='td-01').get_text(strip=True)
        keyword = tr.find('a').get_text(strip=True)
        try:
            hot_score = tr.find('span').get_text(strip=True)
        except AttributeError:
            hot_score = '0'
        
        hot_items.append({
            'rank': int(rank),
            'keyword': keyword,
            'hot_score': int(hot_score) if hot_score.isdigit() else 0
        })
    
    return hot_items

數據存儲方案

4.1 CSV文件存儲

import pandas as pd

def save_to_csv(data, filename='weibo_hot.csv'):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf_8_sig')

4.2 MySQL數據庫存儲

import pymysql

def save_to_mysql(data):
    conn = pymysql.connect(
        host='localhost',
        user='root',
        password='123456',
        database='weibo_data'
    )
    
    with conn.cursor() as cursor:
        sql = """CREATE TABLE IF NOT EXISTS hot_search (
            id INT AUTO_INCREMENT PRIMARY KEY,
            rank INT,
            keyword VARCHAR(100),
            hot_score INT,
            create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )"""
        cursor.execute(sql)
        
        for item in data:
            insert_sql = "INSERT INTO hot_search (rank, keyword, hot_score) VALUES (%s, %s, %s)"
            cursor.execute(insert_sql, (item['rank'], item['keyword'], item['hot_score']))
    
    conn.commit()
    conn.close()

數據可視化實現

5.1 Matplotlib基礎圖表

import matplotlib.pyplot as plt

def plot_top10(data):
    top10 = sorted(data, key=lambda x: x['hot_score'], reverse=True)[:10]
    keywords = [x['keyword'] for x in top10]
    scores = [x['hot_score'] for x in top10]
    
    plt.figure(figsize=(12, 6))
    bars = plt.barh(keywords, scores, color='#FF6B81')
    plt.xlabel('熱度值')
    plt.title('微博熱搜TOP10')
    
    # 添加數據標簽
    for bar in bars:
        width = bar.get_width()
        plt.text(width, bar.get_y() + bar.get_height()/2, f'{width:,}')
    
    plt.tight_layout()
    plt.savefig('top10.png', dpi=300)

5.2 Pyecharts交互式可視化

from pyecharts.charts import Bar
from pyecharts import options as opts

def create_interactive_chart(data):
    top20 = sorted(data, key=lambda x: x['hot_score'], reverse=True)[:20]
    
    bar = (
        Bar()
        .add_xaxis([x['keyword'] for x in top20])
        .add_yaxis("熱度值", [x['hot_score'] for x in top20])
        .reversal_axis()
        .set_global_opts(
            title_opts=opts.TitleOpts(title="微博熱搜TOP20"),
            xaxis_opts=opts.AxisOpts(name="熱度"),
            yaxis_opts=opts.AxisOpts(
                name="關鍵詞",
                axislabel_opts=opts.LabelOpts(font_size=8)
            )
        )
    )
    return bar.render("hot_search.html")

5.3 詞云圖生成

from wordcloud import WordCloud
import jieba

def generate_wordcloud(data):
    text = ' '.join([x['keyword'] for x in data])
    word_list = jieba.cut(text)
    word_str = ' '.join(word_list)
    
    wc = WordCloud(
        font_path='msyh.ttc',
        background_color='white',
        width=800,
        height=600
    ).generate(word_str)
    
    plt.imshow(wc)
    plt.axis('off')
    plt.savefig('wordcloud.png', dpi=300)

完整代碼示例

（此處因篇幅限制展示核心代碼，完整代碼需包含異常處理、日志記錄等）

# weibo_hot_crawler.py
import logging
from datetime import datetime

def main():
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s: %(message)s'
    )
    
    try:
        html = fetch_weibo_hot()
        if html:
            data = parse_hot_data(html)
            
            # 數據存儲
            save_to_csv(data)
            save_to_mysql(data)
            
            # 可視化
            plot_top10(data)
            create_interactive_chart(data)
            generate_wordcloud(data)
            
            logging.info(f"成功處理{len(data)}條熱搜數據")
    except Exception as e:
        logging.error(f"程序異常: {e}")

if __name__ == '__main__':
    main()

項目擴展方向

定時任務：使用APScheduler實現每小時自動爬取
情感分析：結合SnowNLP分析熱搜評論情緒
歷史數據分析：建立時間序列預測模型
預警系統：設置關鍵詞觸發通知
移動端展示：通過Flask/Django構建Web應用

常見問題與解決方案

反爬機制：
- 問題：返回403狀態碼
- 方案：輪換User-Agent，使用代理IP池
數據缺失：
- 問題：部分字段解析失敗
- 方案：添加try-catch塊，設置默認值
編碼問題：
- 問題：中文亂碼
- 方案：統一使用utf-8編碼，保存CSV時用utf_8_sig
可視化渲染：
- 問題：Matplotlib中文顯示方框
- 方案：指定中文字體路徑
數據庫連接：
- 問題：pymysql連接超時
- 方案：增加連接重試機制，設置合理的超時時間

通過本項目的實踐，讀者不僅可以掌握Python爬蟲與可視化的核心技能，更能構建一套完整的輿情監測系統原型。建議在合法合規的前提下進行數據采集，注意遵守微博平臺的相關使用協議。 “`

注：實際執行時需要注意： 1. 微博需要登錄才能獲取完整數據，需替換真實的Cookie 2. MySQL連接參數需要根據實際環境配置 3. 字體文件路徑需替換為系統存在的字體 4. 建議添加適當的延時避免請求過于頻繁

向AI問一下細節

python如何爬取微博熱搜并實現數據可視化

微博熱搜數據爬取實戰

3.1 分析微博網頁結構

3.2 使用Requests獲取數據

3.3 數據解析與清洗

數據存儲方案

4.1 CSV文件存儲

4.2 MySQL數據庫存儲

數據可視化實現

5.1 Matplotlib基礎圖表

5.2 Pyecharts交互式可視化

5.3 詞云圖生成

完整代碼示例

項目擴展方向

常見問題與解決方案

猜你喜歡

最新資訊

相關推薦

相關標簽