# Python如何搭建爬蟲程序
## 一、爬蟲技術概述
網絡爬蟲(Web Crawler)是一種自動獲取網頁內容的程序,廣泛應用于搜索引擎、數據分析和信息聚合等領域。Python憑借其豐富的庫和簡潔的語法,成為構建爬蟲的首選語言之一。
### 核心組件
1. **HTTP請求庫**:如`requests`、`urllib`
2. **HTML解析庫**:如`BeautifulSoup`、`lxml`
3. **數據存儲模塊**:如`csv`、`sqlite3`
4. **并發處理**:如`asyncio`、`Scrapy`框架
---
## 二、基礎爬蟲搭建步驟
### 1. 環境準備
```python
pip install requests beautifulsoup4
import requests
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
print(response.status_code) # 200表示成功
print(response.text[:500]) # 打印前500字符
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h1')
for title in titles:
print(title.get_text())
import csv
with open('output.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['標題', '鏈接'])
for link in soup.find_all('a'):
writer.writerow([link.get_text(), link.get('href')])
使用selenium
模擬瀏覽器:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://dynamic-site.com")
dynamic_content = driver.page_source
import time
import random
time.sleep(random.uniform(1, 3))
創建項目:
pip install scrapy
scrapy startproject myproject
定義爬蟲:
import scrapy
class MySpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com"]
def parse(self, response):
yield {
'title': response.css('h1::text').get(),
'url': response.url
}
/robots.txt
import requests
from bs4 import BeautifulSoup
import csv
import time
def simple_crawler():
url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
books = soup.select('article.product_pod')
with open('books.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Title', 'Price', 'Rating'])
for book in books:
title = book.h3.a['title']
price = book.select('p.price_color')[0].get_text()
rating = book.p['class'][1]
writer.writerow([title, price, rating])
print("數據抓取完成")
except Exception as e:
print(f"發生錯誤: {e}")
if __name__ == "__main__":
simple_crawler()
requests.get(url, verify=False) # 不推薦生產環境使用
response.encoding = 'gbk' # 或utf-8
session = requests.Session()
session.post(login_url, data={'user':'name', 'pass':'word'})
Python爬蟲開發從簡單到復雜有多種實現方式,建議:
1. 從基礎requests+BeautifulSoup
開始
2. 逐步學習Scrapy等框架
3. 始終遵守法律法規
4. 定期關注反爬技術演變
提示:實際開發中建議添加異常處理、日志記錄等功能增強健壯性。 “`
(全文約1100字)
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。