溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Python學習教程：成語查詢工具 - 數據獲取

發布時間：2020-08-19 02:55:38 來源：ITPUB博客閱讀：204 作者：千鋒Python唐小強欄目：編程語言

Python學習教程：成語查詢工具 - 數據獲取

我們從這個網站上獲取想要的內容，不用考慮太多的板塊，直接按照字母檢索即可

Python學習教程：成語查詢工具 - 數據獲取

進去每個字母的頁面中獲取數據以及循環頁數，值得注意的是頁面中有相當多的重復項，記得進行去重操作

Python學習教程：成語查詢工具 - 數據獲取

1. 頁面獲取

常規套路，因為這里需要用到xpath，所以直接返回html字符串，這里因為數據中有大量中文繁體字的原因，選擇字符編碼為gbk

def get_html(url):
 r = requests.get(url, headers=headers)
 r.encoding = 'gbk'
 return r.text

2. 當前頁數據獲取

頁面中的成語以及釋義都是保存在列表中的，直接對列表遍歷獲取即可(僅當前頁)，值得注意的是需要對重復項清洗，這里使用匿名函數lambda z: dict([(x, y) for y, x in z.items()]),對字典的鍵值執行兩次翻轉

def get_curr(url):
 html = etree.HTML(get_html(url))
 lis = html.xpath('//li[@class="licontent"]')
 context = {}
 for li in lis:
 if li.xpath('./span[@class="hz"]/a/text()') and li.xpath('./span[@class="js"]/text()'):
 idiom = li.xpath('./span[@class="hz"]/a/text()')[0]
 interpretation = li.xpath('./span[@class="js"]/text()')[0]
 context[idiom] = interpretation
 func = lambda z: dict([(x, y) for y, x in z.items()])
 idiom_dict = func(func(context))
 return idiom_dict

3. 頁數循環

頁面底部有頁數的標簽，包括總頁數、當前頁、末頁、下一頁等，但是如果總頁面僅1頁的就沒有任何顯示，到達項目尾頁時就沒有任何頁數標簽顯示了(怪不怪?),我們這里就獲取到總頁數和當前的字母索引即可，這里的write_data和print是為了查看一下每個字母索引的數據情況，因為最后的執行會將數據寫入一個單獨的文件，如果你想要看到每個字母的成語，就可以取消這里的注釋查看

def run(url, context):
 html = etree.HTML(get_html(url))
 if html.xpath('//a[contains(text(), "末頁")]/@href'):
 text = html.xpath('//a[contains(text(), "末頁")]/@href')[0]
 letter = re.search('\w', text).group(0) or url.split('/')[-1][0]
 total = re.search('\d+', text).group(0) or 1
 else:
 letter = url.split('/')[-1][0]
 total = 1
 for num in range(1, int(total) + 1):
 page_context = get_curr('http://chengyu.kxue.com/pinyin/' + letter + '_' + str(num) + '.html')
 context.update(page_context)
 print("完成{}的添加,共{}".format(letter + '_' + str(num), total))
 #write_data('grandSon/' + url.split('/')[-1][0] + '.json', context)
 #print("完成{}的寫入".format(url.split('/')[-1][0]))
 return context

4. 數據寫入

直接轉成json寫入文件，可以調整一下格式

def write_data(file, context):
 with open(file, 'w', encoding='utf-8') as f:
 f.write(json.dumps(context, indent=2, ensure_ascii=False))

5. 遍歷所有字母

去網頁主頁遍歷所有字母的鏈接，然后對每個鏈接調用以上方法即可

url = "http://chengyu.kxue.com/"
 html = etree.HTML(get_html(url))
 file = 'idiom.json'
 context = {}
 urls = html.xpath('//div[@class="content letter"]/li/a/@href')
 for url in urls:
 context.update(run("http://chengyu.kxue.com" + url, {}))
 write_data(file, context)

伙伴們有不清楚的地方，可以留言，更多的關于 Python實戰和學習教程也會繼續為大家更新！

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
搭建企業內部yum倉庫(centos6+centos7+epel源)
下一篇新聞：
已添加引用,缺少using指令或程序集的引用

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女