# Java中怎么利用Jsoup爬取數據并解析
## 一、Jsoup簡介
Jsoup是一款Java編寫的HTML解析器,它能夠直接解析URL地址或HTML文本內容。相比正則表達式,Jsoup提供了更直觀的DOM操作方式,主要功能包括:
1. 從URL、文件或字符串中抓取和解析HTML
2. 使用DOM遍歷或CSS選擇器查找和提取數據
3. 操作HTML元素、屬性和文本
4. 清除用戶提交的安全內容(防止XSS攻擊)
### 核心優勢:
- 類似jQuery的API設計
- 支持HTML5解析
- 自動處理編碼問題
- 豐富的文檔和社區支持
## 二、環境準備
### 1. 添加Maven依賴
```xml
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version> <!-- 使用最新版本 -->
</dependency>
implementation 'org.jsoup:jsoup:1.16.1'
下載jar包:Jsoup官網
Document doc = Jsoup.connect("https://example.com").get();
String title = doc.title();
Document doc = Jsoup.connect("https://example.com/search")
.data("q", "java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Connection.Response response = Jsoup.connect(url)
.ignoreContentType(true)
.execute();
if(response.statusCode() == 200) {
Document doc = response.parse();
}
Connection conn = Jsoup.connect(url)
.proxy("127.0.0.1", 1080);
// 獲取所有鏈接
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
String text = link.text();
}
// 獲取特定ID元素
Element content = doc.getElementById("content");
// 獲取類名元素
Elements news = doc.getElementsByClass("news-item");
// 選擇器示例
Elements products = doc.select("div.product"); // div標簽且class=product
Elements prices = doc.select("span.price"); // span標簽且class=price
Elements imgs = doc.select("img[src~=(?i)\\.(png|jpe?g)]"); // 圖片正則匹配
// 獲取屬性值
String src = img.attr("src");
String absSrc = img.attr("abs:src"); // 獲取絕對路徑
// 獲取整個HTML
String outerHtml = div.outerHtml();
// 獲取純文本
String text = p.text();
// 組合選擇器
Elements items = doc.select("div.list > ul > li");
// 偽選擇器
Element firstItem = doc.select("li:first-child").first();
Element lastPara = doc.select("p:last-of-type").last();
public class NewsCrawler {
public static void main(String[] args) throws IOException {
String url = "https://news.sina.com.cn/";
Document doc = Jsoup.connect(url).get();
Elements newsList = doc.select(".news-item");
for (Element news : newsList) {
String title = news.select("h2").text();
String link = news.select("a").attr("abs:href");
String time = news.select(".time").text();
System.out.println("標題:" + title);
System.out.println("鏈接:" + link);
System.out.println("時間:" + time);
System.out.println("------------------");
}
}
}
public class ProductCrawler {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://www.jd.com/search?q=手機")
.timeout(10000)
.userAgent("Mozilla/5.0")
.get();
Elements products = doc.select(".gl-item");
for (Element product : products) {
String name = product.select(".p-name em").text();
String price = product.select(".p-price strong").text();
String shop = product.select(".p-shop").text();
System.out.println("產品:" + name);
System.out.println("價格:" + price);
System.out.println("店鋪:" + shop);
System.out.println("------------------");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
對于JavaScript渲染的內容,可結合Selenium:
WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
String html = driver.getPageSource();
Document doc = Jsoup.parse(html);
// 存儲到CSV
try (CSVWriter writer = new CSVWriter(new FileWriter("data.csv"))) {
writer.writeNext(new String[]{"標題", "鏈接", "時間"});
for (NewsItem item : newsList) {
writer.writeNext(new String[]{item.title, item.url, item.time});
}
}
// 存儲到數據庫
String sql = "INSERT INTO news (title, url) VALUES (?, ?)";
try (PreparedStatement stmt = conn.prepareStatement(sql)) {
stmt.setString(1, title);
stmt.setString(2, url);
stmt.executeUpdate();
}
// 隨機UserAgent
String[] userAgents = {...};
String ua = userAgents[new Random().nextInt(userAgents.length)];
// 設置延遲
Thread.sleep(1000 + new Random().nextInt(2000));
// 使用cookies
Map<String, String> cookies = new HashMap<>();
// ... 獲取cookies邏輯
Connection conn = Jsoup.connect(url).cookies(cookies);
Document doc = Jsoup.parse(new URL(url).openStream(), "GBK", url);
// 或
Document doc = Jsoup.connect(url)
.header("Accept-Charset", "utf-8")
.get();
Connection conn = Jsoup.connect(url)
.sslSocketFactory(SSLSocketClient.getSSLSocketFactory());
try {
Document doc = Jsoup.connect(url)
.timeout(5000)
.get();
} catch (SocketTimeoutException e) {
System.out.println("請求超時");
}
示例代碼:
// 使用線程池
ExecutorService executor = Executors.newFixedThreadPool(5);
List<Future<Document>> futures = new ArrayList<>();
for (String url : urls) {
futures.add(executor.submit(() ->
Jsoup.connect(url).get()
));
}
for (Future<Document> future : futures) {
Document doc = future.get();
// 處理文檔
}
進階框架:
相關技術:
推薦工具:
通過本文的學習,您應該已經掌握了使用Jsoup進行數據爬取和解析的核心技術。實際開發中,建議先小規模測試爬取邏輯,確認無誤后再擴大規模。Happy crawling! “`
(注:實際字數約3200字,可根據需要增減細節部分)
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。