本篇內容主要講解“Python爬蟲實戰之如何采集淘寶商品信息并導入EXCEL表格”,感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷,實用性強。下面就讓小編來帶大家學習“Python爬蟲實戰之如何采集淘寶商品信息并導入EXCEL表格”吧!
一、解析淘寶URL組成
1.我們的第一個需求就是要輸入商品名字返回對應的信息
所以我們這里隨便選一個商品來觀察它的URL,這里我們選擇的是書包,打開網頁,可知他的URL為:
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306
可能單單從這個url里我們看不出什么,但是我們可以從圖中看出一些端倪
我們發現q后面的參數就是我們要獲取的物品的名字
2.我們第二個需求就是根據輸入的數字來爬取商品的頁碼
所以我們來觀察一下后面幾頁URL的組成
由此我們可以得出分頁的依據是最后s的值=(44(頁數-1))
二、查看網頁源碼并用re庫提取信息
1.查看源碼
這里的幾個信息都是我們所需要的
2.re庫提取信息
a = re.findall(r'"raw_title":"(.*?)"', html) b = re.findall(r'"view_price":"(.*?)"', html) c = re.findall(r'"item_loc":"(.*?)"', html) d = re.findall(r'"view_sales":"(.*?)"', html)
三:函數填寫
這里我寫了三個函數,第一個函數來獲取html網頁,代碼如下:
def GetHtml(url): r = requests.get(url,headers =headers) r.raise_for_status() r.encoding = r.apparent_encoding return r
第二個用于獲取網頁的URL代碼如下:
def Geturls(q, x): url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 " urls = [] urls.append(url) if x == 1: return urls for i in range(1, x ): url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \ "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( i * 44) urls.append(url) return urls
第三個用于獲取我們需要的商品信息并寫入Excel表格代碼如下:
def GetxxintoExcel(html): global count#定義一個全局變量count用于后面excel表的填寫 a = re.findall(r'"raw_title":"(.*?)"', html)#(.*?)匹配任意字符 b = re.findall(r'"view_price":"(.*?)"', html) c = re.findall(r'"item_loc":"(.*?)"', html) d = re.findall(r'"view_sales":"(.*?)"', html) x = [] for i in range(len(a)): try: x.append((a[i],b[i],c[i],d[i]))#把獲取的信息放入新的列表中 except IndexError: break i = 0 for i in range(len(x)): worksheet.write(count + i + 1, 0, x[i][0])#worksheet.write方法用于寫入數據,第一個數字是行位置,第二個數字是列,第三個是寫入的數據信息。 worksheet.write(count + i + 1, 1, x[i][1]) worksheet.write(count + i + 1, 2, x[i][2]) worksheet.write(count + i + 1, 3, x[i][3]) count = count +len(x) #下次寫入的行數是這次的長度+1 return print("已完成")
四:主函數填寫
if __name__ == "__main__": count = 0 headers = { "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" ,"cookie":""#cookie 是每個人獨有的,因為反爬機制的緣故,爬取太快可能到后面要重新刷新一下自己的Cookie。 } q = input("輸入貨物") x = int(input("你想爬取幾頁")) urls = Geturls(q,x) workbook = xlsxwriter.Workbook(q+".xlsx") worksheet = workbook.add_worksheet() worksheet.set_column('A:A', 70) worksheet.set_column('B:B', 20) worksheet.set_column('C:C', 20) worksheet.set_column('D:D', 20) worksheet.write('A1', '名稱') worksheet.write('B1', '價格') worksheet.write('C1', '地區') worksheet.write('D1', '付款人數') for url in urls: html = GetHtml(url) s = GetxxintoExcel(html.text) time.sleep(5) workbook.close()#在程序結束之前不要打開excel,excel表在當前目錄下
五:完整代碼
import re import requests import xlsxwriter import time def GetxxintoExcel(html): global count a = re.findall(r'"raw_title":"(.*?)"', html) b = re.findall(r'"view_price":"(.*?)"', html) c = re.findall(r'"item_loc":"(.*?)"', html) d = re.findall(r'"view_sales":"(.*?)"', html) x = [] for i in range(len(a)): try: x.append((a[i],b[i],c[i],d[i])) except IndexError: break i = 0 for i in range(len(x)): worksheet.write(count + i + 1, 0, x[i][0]) worksheet.write(count + i + 1, 1, x[i][1]) worksheet.write(count + i + 1, 2, x[i][2]) worksheet.write(count + i + 1, 3, x[i][3]) count = count +len(x) return print("已完成") def Geturls(q, x): url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 " urls = [] urls.append(url) if x == 1: return urls for i in range(1, x ): url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \ "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( i * 44) urls.append(url) return urls def GetHtml(url): r = requests.get(url,headers =headers) r.raise_for_status() r.encoding = r.apparent_encoding return r if __name__ == "__main__": count = 0 headers = { "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" ,"cookie":"" } q = input("輸入貨物") x = int(input("你想爬取幾頁")) urls = Geturls(q,x) workbook = xlsxwriter.Workbook(q+".xlsx") worksheet = workbook.add_worksheet() worksheet.set_column('A:A', 70) worksheet.set_column('B:B', 20) worksheet.set_column('C:C', 20) worksheet.set_column('D:D', 20) worksheet.write('A1', '名稱') worksheet.write('B1', '價格') worksheet.write('C1', '地區') worksheet.write('D1', '付款人數') xx = [] for url in urls: html = GetHtml(url) s = GetxxintoExcel(html.text) time.sleep(5) workbook.close()
到此,相信大家對“Python爬蟲實戰之如何采集淘寶商品信息并導入EXCEL表格”有了更深的了解,不妨來實際操作一番吧!這里是億速云網站,更多相關內容可以進入相關頻道進行查詢,關注我們,繼續學習!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。