在Python中,使用requests庫進行網絡請求時,可以通過以下方法進行性能優化:
HTTPAdapter的pool_connections和pool_maxsize參數,可以限制最大并發連接數和每個主機的最大連接數。from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
adapter = HTTPAdapter(max_retries=Retry(total=3), pool_connections=100, pool_maxsize=100)
session.mount('http://', adapter)
session.mount('https://', adapter)
concurrent.futures模塊中的ThreadPoolExecutor或ThreadPool類來實現多線程爬蟲。這樣可以同時處理多個請求,提高性能。from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
response = requests.get(url)
return response.text
urls = ['http://example.com'] * 10
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(fetch, urls))
asyncio庫和aiohttp庫實現異步爬蟲。異步編程可以在等待服務器響應時執行其他任務,從而提高性能。import aiohttp
import asyncio
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def main():
urls = ['http://example.com'] * 10
tasks = [fetch(url) for url in urls]
results = await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
import requests
import time
url = 'http://example.com'
cache_file = 'cache.txt'
def save_cache(response, url):
with open(cache_file, 'w') as f:
f.write(f'{url}: {response}\n')
def load_cache():
try:
with open(cache_file, 'r') as f:
for line in f:
url, response = line.strip().split(':')
return url, response
except FileNotFoundError:
return None, None
def get_response(url):
cached_url, cached_response = load_cache()
if cached_url == url and time.time() - float(cached_response.split(':')[1]) < 3600:
return cached_response
response = requests.get(url)
save_cache(response, url)
return response.text
ratelimit來實現更高級的速率限制。import time
import requests
url = 'http://example.com'
def rate_limited_request(url, delay=1):
response = requests.get(url)
time.sleep(delay)
return response
for _ in range(10):
response = rate_limited_request(url)
通過以上方法,可以在很大程度上提高Python爬蟲的性能。在實際應用中,可以根據需求選擇合適的優化策略。