今天小編給大家分享一下Python異步爬取知乎熱榜的方法的相關知識點,內容詳細,邏輯清晰,相信大部分人都還太了解這方面的知識,所以分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后有所收獲,下面我們一起來了解一下吧。
import asyncio
from bs4 import BeautifulSoup
import aiohttp
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C'
}
async def getPages(url):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as resp:
print(resp.status) # 打印狀態碼
html=await resp.text()
soup=BeautifulSoup(html,'lxml')
items=soup.select('.HotList-item')
for item in items:
title=item.select('.HotList-itemTitle')[0].text
try:
abstract=item.select('.HotList-itemExcerpt')[0].text
except:
abstract='No Abstract'
hot=item.select('.HotList-itemMetrics')[0].text
try:
img=item.select('.HotList-itemImgContainer img')['src']
except:
img='No Img'
print("{}\n{}\n{}".format(title,abstract,img))
if __name__ == '__main__':
url='https://www.zhihu.com/billboard'
loop=asyncio.get_event_loop()
loop.run_until_complete(getPages(url))
loop.close()
發現詳細鏈接、圖片鏈接、問題摘要等都在JS里面(CSDN的開發者助手插件確實好用)

正則表達式獲取上述信息:

接下來就是詳細的代碼啦
import asyncio
import json
import re
import aiohttp
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'referer': 'https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C'
}
async def getPages(url):
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as resp:
print(resp.status) # 打印狀態碼
html=await resp.text()
regex=re.compile('"hotList":(.*?),"guestFeeds":')
text=regex.search(html).group(1)
# print(json.loads(text)) # json換成字典格式
for item in json.loads(text):
title=item['target']['titleArea']['text']
question=item['target']['excerptArea']['text']
hot=item['target']['metricsArea']['text']
link=item['target']['link']['url']
img=item['target']['imageArea']['url']
if not img:
img='No Img'
if not question:
question='No Abstract'
print("Title:{}\nPopular:{}\nQuestion:{}\nLink:{}\nImg:{}".format(title,hot,question,link,img))
if __name__ == '__main__':
url='https://www.zhihu.com/billboard'
loop=asyncio.get_event_loop()
loop.run_until_complete(getPages(url))
loop.close()以上就是“Python異步爬取知乎熱榜的方法”這篇文章的所有內容,感謝各位的閱讀!相信大家閱讀完這篇文章都有很大的收獲,小編每天都會為大家更新不同的知識,如果還想學習更多的知識,請關注億速云行業資訊頻道。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。