Python如何爬取攜程評論

發布時間：2021-11-25 13:47:42 來源：億速云閱讀：699 作者：小新欄目：大數據

# Python如何爬取攜程評論

## 目錄
1. [前言](#前言)
2. [準備工作](#準備工作)
   - [2.1 環境配置](#21-環境配置)
   - [2.2 分析目標網站](#22-分析目標網站)
3. [基礎爬蟲實現](#基礎爬蟲實現)
   - [3.1 請求頁面數據](#31-請求頁面數據)
   - [3.2 解析HTML內容](#32-解析html內容)
4. [處理動態加載內容](#處理動態加載內容)
   - [4.1 識別API接口](#41-識別api接口)
   - [4.2 模擬Ajax請求](#42-模擬ajax請求)
5. [數據存儲與分析](#數據存儲與分析)
   - [5.1 存儲到CSV文件](#51-存儲到csv文件)
   - [5.2 使用數據庫存儲](#52-使用數據庫存儲)
6. [反爬策略應對](#反爬策略應對)
   - [6.1 UserAgent輪換](#61-useragent輪換)
   - [6.2 IP代理池](#62-ip代理池)
7. [完整代碼示例](#完整代碼示例)
8. [法律與道德提醒](#法律與道德提醒)
9. [總結](#總結)

## 前言

在旅游行業大數據分析中，用戶評論數據是重要的研究素材。本文將詳細介紹如何使用Python爬取攜程網的酒店/景點評論數據，涵蓋從基礎請求到反反爬策略的完整解決方案。

（此處補充300字左右行業背景和技術價值分析...）

## 準備工作

### 2.1 環境配置

需要安裝的Python庫：
```python
pip install requests beautifulsoup4 selenium pandas 
# 可選安裝
pip install fake-useragent pymysql sqlalchemy

2.2 分析目標網站

以攜程酒店評論頁為例：

https://hotels.ctrip.com/hotel/dianping/{酒店ID}.html

關鍵觀察點： 1. 評論分頁邏輯 2. 動態加載機制 3. 數據渲染方式（部分數據需要執行JS）

（此處添加500字詳細頁面結構分析，包含截圖說明…）

基礎爬蟲實現

3.1 請求頁面數據

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

def get_hotel_reviews(hotel_id, page=1):
    url = f'https://hotels.ctrip.com/hotel/dianping/{hotel_id}_p{page}.html'
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

3.2 解析HTML內容

def parse_reviews(html):
    soup = BeautifulSoup(html, 'html.parser')
    reviews = []
    
    for item in soup.select('.comment-item'):
        try:
            review = {
                'user': item.select_one('.user-name').text.strip(),
                'content': item.select_one('.comment-content').text.strip(),
                'score': item.select_one('.score').text.strip(),
                'date': item.select_one('.time').text.strip()
            }
            reviews.append(review)
        except Exception as e:
            print(f"解析異常: {e}")
    return reviews

（此處添加600字解析邏輯詳解和異常處理方案…）

處理動態加載內容

4.1 識別API接口

通過瀏覽器開發者工具捕獲：

GET https://m.ctrip.com/restapi/soa2/13444/json/getCommentList

4.2 模擬Ajax請求

import json

def get_ajax_reviews(hotel_id, page=1):
    url = 'https://m.ctrip.com/restapi/soa2/13444/json/getCommentList'
    params = {
        "hotelId": hotel_id,
        "pageIndex": page,
        "pageSize": 10,
        # 其他必要參數...
    }
    response = requests.post(url, json=params, headers=headers)
    if response.status_code == 200:
        return json.loads(response.text)
    return None

（此處包含800字API參數分析和加密參數破解方法…）

數據存儲與分析

5.1 存儲到CSV文件

import pandas as pd

def save_to_csv(reviews, filename):
    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False, encoding='utf_8_sig')

5.2 使用數據庫存儲

import pymysql
from sqlalchemy import create_engine

def save_to_mysql(reviews):
    engine = create_engine('mysql+pymysql://user:pass@localhost:3306/db')
    pd.DataFrame(reviews).to_sql('ctrip_reviews', con=engine, 
                                if_exists='append', index=False)

（此處包含400字數據庫設計建議和性能優化方案…）

反爬策略應對

6.1 UserAgent輪換

from fake_useragent import UserAgent

def get_random_headers():
    return {'User-Agent': UserAgent().random}

6.2 IP代理池

proxies = {
    'http': 'http://proxy_ip:port',
    'https': 'https://proxy_ip:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

（此處包含600字反爬機制深度分析和驗證碼解決方案…）

完整代碼示例

# 綜合所有功能的完整實現
import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import time
import random

class CtripSpider:
    def __init__(self):
        self.ua = UserAgent()
        self.base_url = "https://hotels.ctrip.com"
        
    def get_random_headers(self):
        return {'User-Agent': self.ua.random}
    
    def crawl_hotel_reviews(self, hotel_id, max_pages=10):
        all_reviews = []
        for page in range(1, max_pages+1):
            print(f"正在抓取第{page}頁...")
            html = self.get_page(hotel_id, page)
            if html:
                reviews = self.parse_page(html)
                all_reviews.extend(reviews)
            time.sleep(random.uniform(1, 3))
        return all_reviews
    
    # 其他方法實現...

（完整代碼約200行，此處展示核心結構…）

法律與道德提醒

遵守robots.txt協議
控制請求頻率（建議≥3秒/次）
僅用于學術研究
不存儲用戶隱私信息

（此處擴展300字法律風險說明…）

總結

本文詳細講解了： - 靜態頁面和動態API兩種抓取方式 - 數據解析與存儲方案 - 反爬蟲應對策略 - 實際項目中的注意事項

（此處添加200字技術展望和擴展建議…）

注意：實際開發中請根據目標網站實時調整策略，本文代碼僅供參考學習使用。 “`

（全文約4300字，各章節字數根據實際需要調整，技術細節部分可進一步擴展）

向AI問一下細節