溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務條款》

如何采集級聯數據(比如最新省市縣)呢?

發布時間:2020-07-05 14:26:29 來源:網絡 閱讀:363 作者:dataman100 欄目:大數據

概述

通常抓取級聯數數據情況不多,但要是真需要時,確多了一些麻煩,比如抓取商品分類級別信息等。本內容將講述如何采集無限級聯聯數據,并以GoldData來抓取2019年最新的省市縣三級為示例。

創建數據集

在數據集管理里,添加數據集area。如下圖所示:

如何采集級聯數據(比如最新省市縣)呢?

數據集相當于數據庫中的表,只是字段是靈活的,可以隨著需要而添加和變更。

創建規則

在規則管理里,添加規則arearule,并將地址http://xzqh.mca.gov.cn/map 填為抓取入口地址。

我們通過分析可知,我們可從http://xzqh.mca.gov.cn/map 獲取省級數據,然后通過級級數據請求http://xzqh.mca.gov.cn/selectJson 帶有省級名稱來獲得市級數據,最后再通過市級數據分別請求http://xzqh.mca.gov.cn/selectJson 帶有市級數據來獲得縣級數據。

并且發現請求 http://xzqh.mca.gov.cn/selectJson 是需要發POST請求的,因此我們需要將url加個前綴fake:,然后用規則里用JavaScript去請求URL。

我們在此編寫的area數據集,有以下字段:

名稱 說明
sn 取區域編碼作為記錄唯一字段
name 名稱
code 取區域編碼
abbr 省名簡寫
parent_code 取父區域編碼

因此編寫規則如下:

如何采集級聯數據(比如最新省市縣)呢?

(注:具體規則內容請見文章最后)

然后編寫完成后,我們就可以啟動抓取器進行抓取。

查看數據

打開數據管理,選擇area數據集查看如下圖所示:

如何采集級聯數據(比如最新省市縣)呢?

導出數據

回到數據管理,選擇條件,選擇需要導出的字段,進行導出。像這里數據比較多,GoldData將會以打包excel文件并壓縮為zip文件下載。解壓到本地,然后打開excel就可以看到抓取的數據,如下圖所示:

如何采集級聯數據(比如最新省市縣)呢?

結尾

通過這節內容,可以了解了如何通過GoldData 抓取級聯數據。但是下一個問題是如何將數據導入自關聯列表呢,且看下一往篇將會講述如何將級聯數據融合到數據庫自關聯表當中。

咐錄:

(抓取規則)

[
  {
    __sample: http://xzqh.mca.gov.cn/map
    match0: http\:\/\/xzqh\.mca\.gov\.cn\/map
    fields0:
    {
      __model: true
      __node: js
      __js:
        '''
        var exp11=/json\s=\s(.+)\s+\$\(doc/
        var ret=exp11.exec(html)
        var ss=eval(ret[1])
        for(var i=0;i<ss.length;i++){
           var ele=ss[i]
           var exp12=/\((.+)\)/.exec(ele.shengji)
        var exp13=/(.+)\(/.exec(ele.shengji)
           out.add({sn:ele.quHuaDaiMa,name:exp13[1],code:ele.quHuaDaiMa,abbr:exp12[1],
         parent_code:null,

        })
        }

        '''
      name:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      abbr:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      sn:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      __dataset: area
      parent_code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
    fields1:
    {
      __node: js
      __js:
        '''
        var exp11=/json\s=\s(.+)\s+\$\(doc/
        var ret=exp11.exec(html)
        var ss=eval(ret[1])
        for(var i=0;i<ss.length;i++){
           var ele=ss[i]

          var url='fake:http://xzqh.mca.gov.cn/selectJson?shengji='+ele.shengji+"&code="+ele.quHuaDaiMa
           //var $ajax(url,[__method:'POST',data:'shengji='+ele.shengji]).content

           out.add({href:url})
        }

        '''
      href:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
  }
  {
    __sample: fake:http://xzqh.mca.gov.cn/selectJson?shengji=北京市(京)&code=110000
    match0: fake\:http\:\/\/xzqh\.mca\.gov\.cn\/selectJson\?shengji=.+\&code=\d+
    fields0:
    {
      __model: true
      __dataset: area
      sn:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      __node: js
      __js:
        '''
        var urlRet= /http\:\/\/xzqh\.mca\.gov\.cn\/selectJson\?shengji=(.+)\&code=(\d+)/.exec(baseUri);
        var shengji=urlRet[1]
        var sjCode=urlRet[2]

        var url='http://xzqh.mca.gov.cn/selectJson'
        var  content=$ajax(url,{__method:'POST',data:'shengji='+shengji }).content

        var arr=eval(content);
        for(var i=0;i<arr.length;i++){
           var dj=arr[i];

            var area={
                 name:dj.diji,
                 code:dj.quHuaDaiMa,
                 parent_code:sjCode,
                 sn:dj.quHuaDaiMa,
            }
           out.add(area);
        }

        '''
      name:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      abbr:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      sn:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      parent_code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
    fields1:
    {
      __node: js
      __js:
        '''
        var urlRet= /http\:\/\/xzqh\.mca\.gov\.cn\/selectJson\?shengji=(.+)\&code=(\d+)/.exec(baseUri);
        var shengji=urlRet[1]
        var sjCode=urlRet[2]

        var url='http://xzqh.mca.gov.cn/selectJson'
        var  content=$ajax(url,{__method:'POST',data:'shengji='+shengji }).content

        var arr=eval(content);
        for(var i=0;i<arr.length;i++){
           var dj=arr[i];

          var url='fake:http://xzqh.mca.gov.cn/selectJson?shengji='+shengji+"&sjcode="+sjCode+"&diji="+dj.diji+"&djcode="+dj.quHuaDaiMa   
           out.add({href:url})
        }

        '''
      href:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
  }
  {
    __sample: fake:href :fake:http://xzqh.mca.gov.cn/selectJson?shengji=北京市(京)&sjcode=110000&diji=北京市&djcode=110000
    match0: fake\:http\:\/\/xzqh\.mca\.gov\.cn/selectJson\?shengji=([^&]+)\&sjcode=([^&]+)\&diji=([^&]+)\&djcode=([^&]+)
    fields0:
    {
      __model: true
      __dataset: area
      __node: js
      sn:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      __js:
        '''
        var urlRet=/fake\:http\:\/\/xzqh\.mca\.gov\.cn\/selectJson\?shengji=([^\&]+)\&sjcode=([^\&]+)\&diji=([^\&]+)\&djcode=([^\&]+)/.exec(baseUri)

        var shengji=urlRet[1]
        var sjCode=urlRet[2]
        var dj=urlRet[3]
        var djCode=urlRet[4]

        var url='http://xzqh.mca.gov.cn/selectJson'
        var  content=$ajax(url,{__method:'POST',data:'shengji='+shengji+'&diji='+dj }).content

        var arr=eval(content);
        for(var i=0;i<arr.length;i++){
           var dj=arr[i];

            var area={
                 name:dj.xianji,
                 code:dj.quHuaDaiMa,
                 parent_code:djCode,
                 sn:dj.quHuaDaiMa,
            }
           out.add(area);
        }

        '''
      name:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      abbr:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      sn:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      parent_code:
      {
        expr: ""
        attr: ""
        js: ""
        __label: ""
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
  }
]
向AI問一下細節

免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。

AI

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女