溫馨提示×

多線程python爬蟲怎樣進行任務取消與恢復

python

小樊

82

2024-12-12 03:36:51

欄目: 編程語言

在Python中，可以使用threading庫來實現多線程爬蟲。為了實現任務取消和恢復，你可以使用threading.Event對象。Event對象可以用來在線程之間傳遞信號，例如取消或恢復任務。

以下是一個簡單的示例，展示了如何使用threading.Event實現任務取消和恢復：

import threading
import requests
from bs4 import BeautifulSoup

class WebCrawler:
    def __init__(self, urls, event):
        self.urls = urls
        self.event = event
        self.threads = []

    def crawl(self, url):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                print(f"Crawled {url}")
            else:
                print(f"Failed to crawl {url}")
        except Exception as e:
            print(f"Error while crawling {url}: {e}")

    def start(self):
        for url in self.urls:
            thread = threading.Thread(target=self.crawl, args=(url,))
            thread.start()
            self.threads.append(thread)

    def cancel(self):
        self.event.set()

    def resume(self):
        self.event.clear()

    def join(self):
        for thread in self.threads:
            thread.join()

if __name__ == "__main__":
    urls = [
        "https://www.example.com",
        "https://www.example2.com",
        "https://www.example3.com"
    ]

    event = threading.Event()
    crawler = WebCrawler(urls, event)

    # Start crawling
    crawler.start()

    # Wait for a while and cancel the task
    import time
    time.sleep(5)
    crawler.cancel()

    # Wait for all threads to finish
    crawler.join()

在這個示例中，我們創建了一個名為WebCrawler的類，它接受一個URL列表和一個Event對象。crawl方法用于爬取URL，start方法用于啟動所有線程，cancel方法用于設置事件以取消任務，resume方法用于清除事件以恢復任務。join方法用于等待所有線程完成。

要使用這個類，你需要創建一個WebCrawler實例，傳入URL列表和一個Event對象。然后，你可以調用start方法啟動爬蟲，使用cancel方法取消任務，以及使用resume方法恢復任務。

0 贊

0 踩

最新問答

相關問答

相關標簽

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女