如何利用selenium庫爬取京東python書籍一百頁存入csv

發布時間：2021-10-11 18:36:16 來源：億速云閱讀：219 作者：柒染欄目：大數據

# 如何利用Selenium庫爬取京東Python書籍一百頁存入CSV

## 一、環境準備

首先需要安裝必要的Python庫：
```python
pip install selenium pandas

同時需下載對應瀏覽器的WebDriver（如Chrome需下載chromedriver），并將其路徑加入系統環境變量。

二、基礎爬蟲框架搭建

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

driver = webdriver.Chrome()
base_url = "https://search.jd.com/Search?keyword=Python&page={}&s=1&click=0"

三、頁面數據抓取邏輯

1. 模擬翻頁操作

京東書籍頁面采用動態加載，需模擬滾動操作：

def scroll_page():
    for i in range(1, 5):
        driver.execute_script(f"window.scrollTo(0, {i*500})")
        time.sleep(0.5)

2. 關鍵數據提取

通過XPath定位書籍信息：

def parse_page():
    books = []
    items = driver.find_elements(By.XPATH, '//div[@id="J_goodsList"]//li[@class="gl-item"]')
    
    for item in items:
        title = item.find_element(By.XPATH, './/div[@class="p-name"]/a/em').text
        price = item.find_element(By.XPATH, './/div[@class="p-price"]//i').text
        books.append([title, price])
    return books

四、完整爬取流程

all_books = []
for page in range(1, 101):  # 爬取100頁
    driver.get(base_url.format(page))
    scroll_page()
    all_books.extend(parse_page())
    print(f"已完成第{page}頁抓取")
    time.sleep(2)  # 避免觸發反爬

五、數據存儲處理

使用pandas保存為CSV：

df = pd.DataFrame(all_books, columns=["書名", "價格"])
df.to_csv("jd_python_books.csv", index=False, encoding='utf_8_sig')
driver.quit()

六、反爬應對策略

隨機延遲：time.sleep(random.uniform(1,3))
使用代理IP
設置請求頭：

options = webdriver.ChromeOptions()
options.add_argument('user-agent=Mozilla/5.0')

七、注意事項

京東頁面結構可能變更，需定期維護XPath
大規模爬取建議使用分布式架構
遵守robots.txt協議，控制請求頻率

完整代碼約80行，實際執行需約30-60分鐘完成100頁抓取。建議在非高峰時段運行，并添加異常處理機制保證穩定性。 “`

向AI問一下細節

如何利用selenium庫爬取京東python書籍一百頁存入csv

二、基礎爬蟲框架搭建

三、頁面數據抓取邏輯

1. 模擬翻頁操作

2. 關鍵數據提取

四、完整爬取流程

五、數據存儲處理

六、反爬應對策略

七、注意事項

猜你喜歡

最新資訊

相關推薦

相關標簽