溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Python如何爬取房天下新樓盤信息

發布時間：2021-11-25 14:32:44 來源：億速云閱讀：240 作者：小新欄目：大數據

# Python如何爬取房天下新樓盤信息

## 前言

在房地產行業數據分析中，獲取新樓盤信息是市場研究的重要環節。房天下（Fang.com）作為國內領先的房地產門戶網站，匯集了大量新樓盤數據。本文將詳細介紹如何使用Python爬取房天下新樓盤信息，包括項目名稱、價格、戶型、開發商等關鍵數據。

---

## 一、準備工作

### 1.1 技術選型
- **編程語言**：Python 3.8+
- **核心庫**：
  - `requests`：發送HTTP請求
  - `BeautifulSoup`/`lxml`：解析HTML
  - `pandas`：數據存儲與分析
  - `time`：設置爬蟲延遲
- **可選工具**：
  - Selenium（應對動態渲染頁面）
  - ProxyPool（IP代理池）

### 1.2 環境安裝
```bash
pip install requests beautifulsoup4 pandas lxml

1.3 目標分析

訪問房天下新樓盤頁面（如：https://newhouse.fang.com），通過開發者工具（F12）分析： - 頁面結構（HTML標簽） - 數據加載方式（靜態/動態） - 翻頁邏輯（URL規律或AJAX請求）

二、爬蟲實現步驟

2.1 獲取單頁數據

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.text
    except Exception as e:
        print(f"請求失敗: {e}")
        return None

2.2 解析頁面內容

def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')
    houses = soup.select('.nlcd_name a')  # 根據實際頁面結構調整選擇器
    
    data = []
    for house in houses:
        try:
            item = {
                'name': house.get_text().strip(),
                'price': house.find_next('div', class_='nhouse_price').text,
                'address': house.find_next('div', class_='address').text,
                'developer': house.find_next('div', class_='fangyuan').text
            }
            data.append(item)
        except AttributeError:
            continue
    return data

2.3 處理翻頁邏輯

房天下通常有兩種翻頁方式： 1. 靜態分頁：URL帶頁碼參數（如/house/s31/b92/） 2. 動態加載：通過AJAX請求獲取數據

方案1：靜態分頁

base_url = "https://newhouse.fang.com/house/s31/b9{}/"
for page in range(1, 11):  # 爬取前10頁
    url = base_url.format(page)
    html = get_one_page(url)
    data = parse_html(html)

方案2：動態分頁（需分析XHR請求）

api_url = "https://newhouse.fang.com/house/ajaxrequest/houseList.php"
params = {
    'page': page,
    'city': '北京'
}
response = requests.post(api_url, data=params)

三、反爬應對策略

3.1 常見反爬措施

User-Agent檢測：需隨機更換UA
IP限制：建議使用代理IP
驗證碼：觸發后需人工處理或使用OCR

3.2 優化代碼

from fake_useragent import UserAgent
import random
import time

def get_random_ua():
    return UserAgent().random

def safe_request(url):
    headers = {'User-Agent': get_random_ua()}
    proxies = {
        'http': 'http://proxy_ip:port',
        'https': 'https://proxy_ip:port'
    }
    time.sleep(random.uniform(1, 3))  # 隨機延遲
    return requests.get(url, headers=headers, proxies=proxies)

四、數據存儲與分析

4.1 存儲到CSV

df = pd.DataFrame(data)
df.to_csv('fangtianxia_new_houses.csv', index=False, encoding='utf_8_sig')

4.2 簡單數據分析示例

# 統計各區域樓盤數量
district_counts = df['address'].str.extract(r'\[(.*?)\]')[0].value_counts()
print(district_counts.head(10))

五、完整代碼示例

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

class FangSpider:
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        self.base_url = "https://newhouse.fang.com/house/s31/b9{}/"
    
    def get_data(self, max_pages=10):
        all_data = []
        for page in range(1, max_pages + 1):
            url = self.base_url.format(page)
            html = self.get_html(url)
            if html:
                page_data = self.parse_html(html)
                all_data.extend(page_data)
            time.sleep(2)
        return all_data
    
    def get_html(self, url):
        try:
            response = requests.get(url, headers=self.headers)
            return response.text if response.ok else None
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            return None
    
    def parse_html(self, html):
        # 實現解析邏輯
        pass

if __name__ == '__main__':
    spider = FangSpider()
    data = spider.get_data()
    pd.DataFrame(data).to_csv('new_houses.csv', index=False)

六、注意事項

遵守robots.txt：檢查房天下的爬蟲協議
控制請求頻率：避免對服務器造成壓力
數據使用規范：僅用于學習研究，禁止商業用途
動態內容處理：部分數據可能需要渲染JS，可配合Selenium使用

結語

通過本文介紹的方法，您可以高效獲取房天下新樓盤數據。實際應用中可能需要根據網站改版調整選擇器，建議定期維護爬蟲代碼。如需更復雜的功能（如自動更新、異常監控），可考慮使用Scrapy框架構建更健壯的爬蟲系統。 “`

（注：實際字數約1600字，具體實現需根據房天下當前頁面結構調整選擇器和邏輯）

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
高仿Android QQ2012登陸界面和注冊界面是怎樣的
下一篇新聞：
怎么用Python數據分析員工們的工作效率和整體滿意度

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女