溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Python中的Beautiful Soup模塊的用法

發布時間：2021-09-04 10:11:36 來源：億速云閱讀：181 作者：chen 欄目：編程語言

這篇文章主要介紹“Python中的Beautiful Soup模塊的用法”，在日常操作中，相信很多人在Python中的Beautiful Soup模塊的用法問題上存在疑惑，小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”Python中的Beautiful Soup模塊的用法”的疑惑有所幫助！接下來，請跟著小編一起來學習吧！

1.Beautiful Soup模塊的介紹

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫，簡單來說，它能將HTML的標簽文件解析成樹形結構，然后方便地獲取到指定標簽的對應屬性，還可以方便的實現全站點的內容爬取和解析；
Beautiful Soup支持Python標準庫中的HTML解析器，還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器；
lxml 是python的一個解析庫，支持HTML和XML的解析，html5lib解析器能夠以瀏覽器的方式解析，且生成HTML5文檔；

pip install beautifulsoup4
pip install html5lib
pip install lxml

2. Beautiful Soup模塊解析HTML文檔

假如現在有一段不完整的HTML代碼，我們現在要使用Beautiful Soup模塊來解析這段HTML代碼

data = '''                                         
<html><head><title>The Dormouse's story</title></he
<body>                                             
<p class="title"><b id="title">The Dormouse's story</b></p>   
<p class="story">Once upon a time there were three 
<a href="http://example.com/elsie" class="sister" i
<a href="http://example.com/lacie" class="sister" i
<a href="http://example.com/tillie" class="sister" 
and they lived at the bottom of a well.</p>        
<p class="story">...</p>                           
'''

首先需要導入BeautifulSoup模塊，再實例化BeautifulSoup對象

from bs4 import BeautifulSoup           
soup = BeautifulSoup(data,'lxml')

然后通過BeautifulSoup提供的方法就可以拿到HTML的元素、屬性、鏈接、文本等，BeautifulSoup模塊可以將不完整的HTML文檔，格式化為完整的HTML文檔，比如我們打印print(soup.prettify())看一下輸出什么？

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b id="title">
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three
   <a a="" and="" at="" bottom="" class="sister" href="http://example.com/elsie" i="" lived="" of="" the="" they="" well.="">
    <p class="story">
     ...
    </p>
   </a>
  </p>
 </body>
</html>

獲取標簽，如title標簽，a標簽等

print('title = {}'.format(soup.title))             
# 輸出：title = <title>The Dormouse's story</title>
print('a={}'.format(soup.a))

獲取標簽的名稱，如title標簽，body標簽等

print('title_name = {}'.format(soup.title.name))
# 輸出：title_name = title
print('body_name = {}'.format(soup.body.name))
# 輸出：body_name = body

獲取標簽的內容，如title標簽

print('title_string = {}'.format(soup.title.string))
#  輸出：title_string = The Dormouse's story

如果想要獲取某個標簽的父標簽的名稱，可以使用parent，如title標簽，可以得到父標簽head標簽，且會自定補齊不完整的標簽；

print('title_pareat_name = {}'.format(soup.title.parent))
# 輸出：title_pareat_name = <head><title>The Dormouse's story</title>
</head>

獲取第一個p標簽

print('p = {}'.format(soup.p))
# 輸出：p = <p class="title"><b>The Dormouse's story</b></p>

獲取第一個p標簽的class的值，獲取第一個a標簽的class值

print('p_class = {}'.format(soup.p["class"]))
# 輸出：p_class = ['title']
print('a_class = {}'.format(soup.a["class"]))
# 輸出：a_class = ['sister']

獲取所有的標簽

#  獲取所有的a標簽
print('a = {}'.format(soup.find_all('a')))
#  獲取所有的p標簽  
print('p = {}'.format(soup.find_all('p')))

獲取id為link3的標簽

print('a_link = {}'.format(soup.find(id='title')))
# 輸出：a_link = <b id="title">The Dormouse's story</b>

3.BeautifulSoup中的對象

BeautifulSoup對象分為四類，分別是Tag(獲取標簽), NavigableString(獲取標簽內容) , BeautifulSoup(根標簽), Comment(標簽內的所有的文本) ；

語法：

soup.標簽名：獲取HTML中的標簽；
soup.標簽名.name：獲取HTML中標簽的名稱；
soup.標簽名.attrs：獲取標簽的所有屬性；
soup.標簽名.string：獲取HTML中標簽的文本內容；
soup.標簽名.parent：獲取HTML中標簽的父標簽；
prettify()方法：可以將Beautiful Soup的文檔樹格式化后以Unicode編碼輸出，每個XML/HTML標簽都獨占一行；

4.遍歷文檔

contents：獲取所有子節點，返回一個列表，可以通過下標取值；

soup = BeautifulSoup(html,"lxml")
# 返回一個列表
print(soup.p.contents)
# 拿到第一個子節點
print(soup.p.contents[0])

children：返回子節點的生成器對象；

for tag in soup.p.children:
    print(tag)

soup.strings：獲取所有節點的內容，包括空格；

soup = BeautifulSoup(html,"lxml")
for content in soup.strings:
    print(repr(content))

soup.stripped_strings：獲取所有節點的內容，不包括空格；

soup = BeautifulSoup(html,"lxml")
for tag in soup.stripped_strings:
    print(repr(tag))

5.查找標簽

find_all()：查找所有指定標簽名稱的子節點（可同時查找多個標簽），并判斷是否符合過濾器的條件，返回一個列表；

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a'))
print(soup.find_all(['a','p']))
print(soup.find_all(re.compile('^a')))

find()：和find_all()差不多，但是find_all() 方法的返回結果是值包含一個元素的列表，而 find() 方法直接返回結果；

soup = BeautifulSoup(html,"lxml")
print(soup.find('a'))

到此，關于“Python中的Beautiful Soup模塊的用法”的學習就結束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習，快去試試吧！若想繼續學習更多相關知識，請繼續關注億速云網站，小編會繼續努力為大家帶來更多實用的文章！

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
oracle全局臨時表有的兩種模式介紹
下一篇新聞：
MySQL中的隱藏列的具體查看方法

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女