# Python怎么爬取B站視頻彈幕
## 前言
在B站(嗶哩嗶哩)觀看視頻時,彈幕是其最具特色的功能之一。這些實時飄過的評論不僅增加了視頻的互動性,也蘊含了大量用戶反饋數據。對于數據分析師、內容創作者或愛好者來說,爬取這些彈幕數據可以幫助分析觀眾情緒、熱門話題等。本文將詳細介紹如何使用Python爬取B站視頻彈幕。
---
## 一、準備工作
### 1.1 理解B站彈幕機制
B站的彈幕數據通常存儲在XML或JSON格式的文件中,每個視頻都有對應的彈幕文件(`cid`標識)。需要通過視頻的`bvid`或`aid`先獲取到`cid`,再通過`cid`獲取彈幕。
### 1.2 安裝必要的Python庫
```bash
pip install requests beautifulsoup4 lxml
以B站視頻 BV1GJ411x7h7
為例:
1. 獲取視頻的cid
2. 通過cid
請求彈幕接口
3. 解析并存儲彈幕數據
B站提供了公開API來獲取視頻信息,其中包含cid
。構造請求URL如下:
import requests
def get_cid(bvid):
url = f"https://api.bilibili.com/x/player/pagelist?bvid={bvid}&jsonp=jsonp"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
return data['data'][0]['cid']
else:
raise Exception("Failed to get CID")
bvid = "BV1GJ411x7h7"
cid = get_cid(bvid)
print(f"CID: {cid}")
如果API不可用,可以通過解析視頻頁面獲?。?/p>
from bs4 import BeautifulSoup
def get_cid_from_html(bvid):
url = f"https://www.bilibili.com/video/{bvid}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
script = soup.find("script", text=lambda t: "window.__playinfo__" in str(t))
# 通過正則提取cid
import re
cid = re.search(r'"cid":(\d+)', script.string).group(1)
return int(cid)
B站的彈幕接口為:
https://api.bilibili.com/x/v1/dm/list.so?oid={cid}
def get_danmaku(cid):
url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
response = requests.get(url)
response.encoding = 'utf-8'
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')
danmus = soup.find_all('d')
return [danmu.text for danmu in danmus]
danmaku_list = get_danmaku(cid)
print(f"獲取到{len(danmaku_list)}條彈幕")
每條彈幕XML格式如下:
<d p="時間戳,彈幕類型,字體大小,顏色,發送時間,彈幕池,用戶Hash,數據庫ID">彈幕內容</d>
可以進一步解析這些屬性:
import re
def parse_danmaku(danmu):
pattern = r'<d p="(.*?)">(.*?)</d>'
matches = re.findall(pattern, str(danmu))
result = []
for m in matches:
attrs = m[0].split(',')
item = {
'time': float(attrs[0]),
'type': int(attrs[1]),
'size': int(attrs[2]),
'color': int(attrs[3]),
'timestamp': int(attrs[4]),
'content': m[1]
}
result.append(item)
return result
def save_as_txt(danmaku_list, filename):
with open(filename, 'w', encoding='utf-8') as f:
for danmu in danmaku_list:
f.write(danmu + '\n')
save_as_txt(danmaku_list, 'danmaku.txt')
import pandas as pd
def save_as_csv(parsed_danmaku, filename):
df = pd.DataFrame(parsed_danmaku)
df.to_csv(filename, index=False)
save_as_csv(parse_danmaku(danmaku_list), 'danmaku.csv')
import pymysql
def save_to_mysql(parsed_danmaku):
conn = pymysql.connect(host='localhost',
user='root',
password='password',
database='bilibili')
cursor = conn.cursor()
sql = """INSERT INTO danmaku
(time, type, size, color, timestamp, content)
VALUES (%s, %s, %s, %s, %s, %s)"""
for item in parsed_danmaku:
cursor.execute(sql, (item['time'], item['type'],
item['size'], item['color'],
item['timestamp'], item['content']))
conn.commit()
conn.close()
對于多P視頻,需要遍歷所有分P的CID:
def get_all_cids(bvid):
url = f"https://api.bilibili.com/x/player/pagelist?bvid={bvid}"
response = requests.get(url)
return [item['cid'] for item in response.json()['data']]
使用aiohttp
加速請求:
import aiohttp
import asyncio
async def async_get_danmaku(cid):
async with aiohttp.ClientSession() as session:
url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
async with session.get(url) as response:
text = await response.text()
soup = BeautifulSoup(text, 'lxml')
return [d.text for d in soup.find_all('d')]
headers = {
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.bilibili.com/'
}
proxies = {'http': 'http://127.0.0.1:1080'}
from wordcloud import WordCloud
import jieba
text = ' '.join(danmaku_list)
wordlist = ' '.join(jieba.cut(text))
wc = WordCloud(font_path='simhei.ttf').generate(wordlist)
wc.to_file('danmaku_cloud.png')
import matplotlib.pyplot as plt
times = [d['time'] for d in parsed_danmaku]
plt.hist(times, bins=50)
plt.xlabel('Video Time (s)')
plt.ylabel('Danmaku Count')
plt.show()
robots.txt
對部分路徑有限制本文詳細介紹了從B站視頻爬取彈幕的完整流程,包括: - 獲取視頻CID的兩種方法 - 請求和解析彈幕XML數據 - 多種存儲方式 - 高級技巧和數據分析示例
通過Python爬取B站彈幕不僅可以幫助我們理解B站的API結構,也為后續的數據分析工作奠定了基礎。希望這篇教程對你有所幫助!
注意:本文所有代碼示例僅供學習參考,實際使用時請遵守B站的相關規定。 “`
(全文約3100字)
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。