溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

如何使用Python爬蟲分析戰狼2豆瓣影評

發布時間：2021-08-12 12:42:58 來源：億速云閱讀：184 作者：小新欄目：開發技術

這篇文章主要介紹了如何使用Python爬蟲分析戰狼2豆瓣影評，具有一定借鑒價值，感興趣的朋友可以參考下，希望大家閱讀完這篇文章之后大有收獲，下面讓小編帶著大家一起了解一下。

目標總覽

主要做了三件事：

抓取網頁數據
清理數據
用詞云進行展示

使用的python版本是3.5.

一、抓取網頁數據

第一步要對網頁進行訪問，python中使用的是urllib庫。代碼如下：

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')

其中https://movie.douban.com/nowp…是豆瓣最新上映的電影頁面，可以在瀏覽器中輸入該網址進行查看。

html_data是字符串類型的變量，里面存放了網頁的html代碼。輸入print(html_data)可以查看，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

第二步，需要對得到的html代碼進行解析，得到里面提取我們需要的數據。在python中使用BeautifulSoup庫進行html代碼的解析。（注：如果沒有安裝此庫，則使用pip install BeautifulSoup進行安裝即可?。〣eautifulSoup使用的格式如下：

undefined

BeautifulSoup(html,"html.parser")

第一個參數為需要提取數據的html，第二個參數是指定解析器，然后使用find_all()讀取html標簽中的內容。

但是html中有這么多的標簽，該讀取哪些標簽呢？其實，最簡單的辦法是我們可以打開我們爬取網頁的html代碼，然后查看我們需要的數據在哪個html標簽里面，再進行讀取就可以了。如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

從上圖中可以看出在div id=”nowplaying“標簽開始是我們想要的數據，里面有電影的名稱、評分、主演等信息。所以相應的代碼編寫如下：

from bs4 import BeautifulSoup as bs
soup = bs(html_data, 'html.parser') 
nowplaying_movie = soup.find_all('div', id='nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')

其中nowplaying_movie_list 是一個列表，可以用print(nowplaying_movie_list[0])查看里面的內容，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

在上圖中可以看到data-subject屬性里面放了電影的id號碼，而在img標簽的alt屬性里面放了電影的名字，因此我們就通過這兩個屬性來得到電影的id和名稱。（注：打開電影短評的網頁時需要用到電影的id，所以需要對它進行解析），編寫代碼如下：

nowplaying_list = [] 
for item in nowplaying_movie_list: 
 nowplaying_dict = {} 
 nowplaying_dict['id'] = item['data-subject'] 
 for tag_img_item in item.find_all('img'):  
  nowplaying_dict['name'] = tag_img_item['alt']  
  nowplaying_list.append(nowplaying_dict)

其中列表nowplaying_list中就存放了最新電影的id和名稱，可以使用print(nowplaying_list)進行查看，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

可以看到和豆瓣網址上面是匹配的。這樣就得到了最新電影的信息了。接下來就要進行對最新電影短評進行分析了。例如《戰狼2》的短評網址為： https://movie.douban.com/subject/26363254/comments?start=0&limit=20

其中26363254就是電影的id，start=0表示評論的第0條評論。

接下來接對該網址進行解析了。打開上圖中的短評頁面的html代碼，我們發現關于評論的數據是在div標簽的comment屬性下面，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

因此對此標簽進行解析，代碼如下：

requrl = 'https://movie.douban.com/subject/' + nowplaying_list[0]['id'] + '/comments' +'?' +'start=0' + '&limit=20' 
resp = request.urlopen(requrl) 
html_data = resp.read().decode('utf-8') 
soup = bs(html_data, 'html.parser') 
comment_div_lits = soup.find_all('div', class_='comment')

此時在comment_div_lits 列表中存放的就是div標簽和comment屬性下面的html代碼了。在上圖中還可以發現在p標簽下面存放了網友對電影的評論，如下圖所示:

如何使用Python爬蟲分析戰狼2豆瓣影評

因此對comment_div_lits 代碼中的html代碼繼續進行解析，代碼如下：

eachCommentList = []; 
for item in comment_div_lits: 
 if item.find_all('p')[0].string is not None: 
  eachCommentList.append(item.find_all('p')[0].string)

使用print(eachCommentList)查看eachCommentList列表中的內容，可以看到里面存里我們想要的影評。如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

好的，至此我們已經爬取了豆瓣最近播放電影的評論數據，接下來就要對數據進行清洗和詞云顯示了。

二、數據清洗

為了方便進行數據進行清洗，我們將列表中的數據放在一個字符串數組中，代碼如下：

comments = ''
for k in range(len(eachCommentList)):
 comments = comments + (str(eachCommentList[k])).strip()

使用print(comments)進行查看，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

可以看到所有的評論已經變成一個字符串了，但是我們發現評論中還有不少的標點符號等。這些符號對我們進行詞頻統計時根本沒有用，因此要將它們清除。所用的方法是正則表達式。python中正則表達式是通過re模塊來實現的。代碼如下：

import re
pattern = re.compile(r'[u4e00-u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

繼續使用print(cleaned_comments)語句進行查看，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

我們可以看到此時評論數據中已經沒有那些標點符號了，數據變得“干凈”了很多。

因此要進行詞頻統計，所以先要進行中文分詞操作。在這里我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制臺使用pip install jieba進行安裝。（注：可以使用pip list查看是否安裝了這些庫）。代碼如下所示：

import jieba #分詞包
import pandas as pd 
segment = jieba.lcut(cleaned_comments)
words_df=pd.DataFrame({'segment':segment})

因為結巴分詞要用到pandas，所以我們這里加載了pandas包?？梢允褂脀ords_df.head()查看分詞之后的結果，如下圖所示：

如何使用Python爬蟲分析戰狼2豆瓣影評

從上圖可以看到我們的數據中有“看”、“太”、“的”等虛詞（停用詞），而這些詞在任何場景中都是高頻時，并且沒有實際的含義，所以我們要他們進行清除。

我把停用詞放在一個stopwords.txt文件中，將我們的數據與停用詞進行比對即可（注：只要在百度中輸入stopwords.txt，就可以下載到該文件）。去停用詞代碼如下代碼如下：

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

繼續使用words_df.head()語句來查看結果，如下圖所示，停用詞已經被出去了。

如何使用Python爬蟲分析戰狼2豆瓣影評

接下來就要進行詞頻統計了，代碼如下：

import numpy #numpy計算包
words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

用words_stat.head()進行查看，結果如下：

如何使用Python爬蟲分析戰狼2豆瓣影評

由于我們前面只是爬取了第一頁的評論，所以數據有點少，在最后給出的完整代碼中，我爬取了10頁的評論，所數據還是有參考價值。

三、用詞云進行顯示

代碼如下：

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#詞云包
wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字體類型、字體大小和字體顏色
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
 temp = (key,word_frequence[key])
 word_frequence_list.append(temp)
wordcloud=wordcloud.fit_words(word_frequence_list)
plt.imshow(wordcloud)

其中simhei.ttf使用來指定字體的，可以在百度上輸入simhei.ttf進行下載后，放入程序的根目錄即可。顯示的圖像如下：

如何使用Python爬蟲分析戰狼2豆瓣影評

到此為止，整個項目的介紹就結束了。由于自己也還是個初學者，接觸python不久，代碼寫的并不好。而且第一次寫技術博客，表達的有些冗余，請大家多多包涵，有不對的地方，請大家批評指正。以后我也會將自己做的小項目以這種形式寫在博客上和大家一起交流！最后貼上完整的代碼。

完整代碼

#coding:utf-8
__author__ = 'hang'
import warnings
warnings.filterwarnings("ignore")
import jieba #分詞包
import numpy #numpy計算包
import codecs #codecs提供的open方法來指定打開的文件的語言編碼，它會在讀取的時候自動轉換為內部unicode 
import re
import pandas as pd 
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#詞云包
#分析網頁函數
def getNowPlayingMovie_list(): 
 resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')  
 html_data = resp.read().decode('utf-8') 
 soup = bs(html_data, 'html.parser') 
 nowplaying_movie = soup.find_all('div', id='nowplaying')  
 nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item') 
 nowplaying_list = [] 
 for item in nowplaying_movie_list:  
  nowplaying_dict = {}  
  nowplaying_dict['id'] = item['data-subject']  
  for tag_img_item in item.find_all('img'):   
   nowplaying_dict['name'] = tag_img_item['alt']   
   nowplaying_list.append(nowplaying_dict) 
 return nowplaying_list
#爬取評論函數
def getCommentsById(movieId, pageNum): 
 eachCommentList = []; 
 if pageNum>0: 
   start = (pageNum-1) * 20 
 else: 
  return False 
 requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20' 
 print(requrl)
 resp = request.urlopen(requrl) 
 html_data = resp.read().decode('utf-8') 
 soup = bs(html_data, 'html.parser') 
 comment_div_lits = soup.find_all('div', class_='comment') 
 for item in comment_div_lits: 
  if item.find_all('p')[0].string is not None:  
   eachCommentList.append(item.find_all('p')[0].string)
 return eachCommentList
def main():
 #循環獲取第一個電影的前10頁評論
 commentList = []
 NowPlayingMovie_list = getNowPlayingMovie_list()
 for i in range(10): 
  num = i + 1 
  commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
  commentList.append(commentList_temp)
 #將列表中的數據轉換為字符串
 comments = ''
 for k in range(len(commentList)):
  comments = comments + (str(commentList[k])).strip()
 #使用正則表達式去除標點符號
 pattern = re.compile(r'[u4e00-u9fa5]+')
 filterdata = re.findall(pattern, comments)
 cleaned_comments = ''.join(filterdata)
 #使用結巴分詞進行中文分詞
 segment = jieba.lcut(cleaned_comments)
 words_df=pd.DataFrame({'segment':segment})
 #去掉停用詞
 stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
 words_df=words_df[~words_df.segment.isin(stopwords.stopword)]
 #統計詞頻
 words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
 words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)
 #用詞云進行顯示
 wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
 word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
 word_frequence_list = []
 for key in word_frequence:
  temp = (key,word_frequence[key])
  word_frequence_list.append(temp)
 wordcloud=wordcloud.fit_words(word_frequence_list)
 plt.imshow(wordcloud)
#主函數
main()

結果顯示如下：

如何使用Python爬蟲分析戰狼2豆瓣影評

感謝你能夠認真閱讀完這篇文章，希望小編分享的“如何使用Python爬蟲分析戰狼2豆瓣影評”這篇文章對大家有幫助，同時也希望大家多多支持億速云，關注億速云行業資訊頻道，更多相關知識等著你來學習!

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
php使用escapeshellarg時中文被過濾的問題怎么解決
下一篇新聞：
Python如何將DataFrame的某一列作為index

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女