溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗證碼

其他方式登錄

點擊登錄注冊即表示同意《億速云用戶服務條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點擊重新獲取二維碼

Java jsoup怎么使用

發布時間：2022-01-26 15:22:58 來源：億速云閱讀：178 作者：iii 欄目：開發技術

# Java jsoup怎么使用

## 一、jsoup簡介

jsoup是一個用于處理實際HTML的Java庫。它提供了一套非常便捷的API，可以通過DOM、CSS以及類似jQuery的操作方法來提取和操作數據。jsoup的主要功能包括：

1. 從URL、文件或字符串中解析HTML
2. 使用DOM遍歷或CSS選擇器查找和提取數據
3. 操作HTML元素、屬性和文本
4. 清除用戶提交的內容以防止XSS攻擊
5. 輸出整潔的HTML

jsoup非常適合用于：
- 網頁抓取和數據提取
- 解析和清理HTML
- 網頁內容分析和處理

## 二、環境準備

### 1. 添加jsoup依賴

Maven項目在pom.xml中添加：
```xml
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version> <!-- 使用最新版本 -->
</dependency>

Gradle項目：

implementation 'org.jsoup:jsoup:1.16.1'

2. 手動下載

可以從jsoup官網下載jar文件，然后手動添加到項目中。

三、基本使用方法

1. 解析HTML文檔

從字符串解析

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

從URL加載

Document doc = Jsoup.connect("https://example.com/").get();
String title = doc.title();

從文件加載

File input = new File("/path/to/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");

2. 數據提取

使用DOM方法

Document doc = Jsoup.connect("https://example.com").get();

// 獲取標題
String title = doc.title();

// 獲取特定id的元素
Element content = doc.getElementById("content");

// 獲取所有鏈接
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}

使用CSS選擇器

// 選擇帶有href屬性的a元素
Elements links = doc.select("a[href]");

// 選擇class為masthead的div
Elements masthead = doc.select("div.masthead");

// 選擇直接子元素
Elements resultLinks = doc.select("h3.r > a");

3. 修改數據

Document doc = Jsoup.parse("<div><p>Lorem ipsum.</p></div>");

// 修改屬性
Element div = doc.select("div").first();
div.attr("class", "newClass");

// 添加類
div.addClass("anotherClass");

// 修改文本內容
div.text("New text content");

// 修改HTML內容
div.html("<p>New <b>HTML</b> content</p>");

// 追加內容
div.append("<p>Appended paragraph</p>");

// 在元素前插入內容
div.prepend("<p>Prepended paragraph</p>");

四、高級功能

1. 處理表單

// 獲取登錄表單
Document doc = Jsoup.connect("http://example.com/login").get();
Element loginForm = doc.selectFirst("form#login");

// 準備表單數據
Connection.Response res = Jsoup.connect("http://example.com/login")
        .data("username", "myUser")
        .data("password", "myPass")
        .method(Connection.Method.POST)
        .execute();

// 獲取登錄后的會話cookie
Map<String, String> cookies = res.cookies();

// 使用cookie訪問受保護頁面
Document protectedPage = Jsoup.connect("http://example.com/protected")
        .cookies(cookies)
        .get();

2. 處理相對路徑

Document doc = Jsoup.connect("https://example.com/news").get();

// 獲取絕對URL
Elements links = doc.select("a[href]");
for (Element link : links) {
    String absUrl = link.attr("abs:href"); // 轉換為絕對URL
    System.out.println(absUrl);
}

3. 清理HTML

String unsafeHtml = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";

// 使用白名單清理
String safeHtml = Jsoup.clean(unsafeHtml, 
        Whitelist.basic()
        .addTags("p")
        .addAttributes("a", "href"));

System.out.println(safeHtml);
// 輸出: <p><a href="http://example.com/">Link</a></p>

4. 代理設置

Document doc = Jsoup.connect("https://example.com")
        .proxy("proxy.example.com", 8080) // 設置代理
        .userAgent("Mozilla/5.0") // 設置User-Agent
        .timeout(10000) // 設置超時時間
        .get();

五、實戰案例

案例1：爬取新聞標題和鏈接

public class NewsCrawler {
    public static void main(String[] args) throws IOException {
        String url = "https://news.example.com";
        Document doc = Jsoup.connect(url).get();
        
        Elements newsHeadlines = doc.select(".news-item h3 a");
        
        for (Element headline : newsHeadlines) {
            String title = headline.text();
            String link = headline.attr("abs:href");
            
            System.out.println("標題: " + title);
            System.out.println("鏈接: " + link);
            System.out.println("------------------");
        }
    }
}

案例2：提取表格數據

public class TableExtractor {
    public static void main(String[] args) throws IOException {
        String url = "https://example.com/data-table";
        Document doc = Jsoup.connect(url).get();
        
        Element table = doc.select("table.data").first();
        Elements rows = table.select("tr");
        
        for (Element row : rows) {
            Elements cols = row.select("td");
            for (Element col : cols) {
                System.out.print(col.text() + "\t");
            }
            System.out.println();
        }
    }
}

案例3：構建HTML文檔

public class HtmlBuilder {
    public static void main(String[] args) {
        Document doc = Document.createShell("");
        doc.title("Generated Page");
        
        Element body = doc.body();
        body.appendElement("h1").text("Welcome to my page");
        
        Element div = body.appendElement("div")
                .attr("class", "content");
                
        div.appendElement("p")
                .text("This is a paragraph.")
                .addClass("highlight");
                
        System.out.println(doc);
    }
}

六、性能優化

緩存解析結果：對于頻繁訪問的頁面，考慮緩存Document對象
限制選擇范圍：先縮小選擇范圍再使用精細選擇器 “`java // 不推薦 doc.select(“div.content p.small”);

// 推薦 Element content = doc.selectFirst(“div.content”); content.select(“p.small”);

3. **合理設置超時**：根據網絡情況調整連接超時時間
4. **使用連接池**：對于大量請求，考慮使用連接池
5. **并行處理**：對于獨立的任務可以使用多線程

## 七、常見問題解決

### 1. 處理SSL證書問題

```java
// 跳過SSL驗證（不推薦生產環境使用）
Connection connection = Jsoup.connect("https://example.com");
connection.sslSocketFactory(SSLSocketClient.getSSLSocketFactory());
Document doc = connection.get();

2. 處理重定向

Document doc = Jsoup.connect("https://example.com")
        .followRedirects(true) // 啟用重定向
        .get();

3. 處理403禁止訪問

Document doc = Jsoup.connect("https://example.com")
        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
        .referrer("http://www.google.com")
        .header("Accept-Language", "en-US")
        .get();

4. 處理大文件

// 使用流式處理大文件
FileInputStream fis = new FileInputStream(new File("large.html"));
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
    sb.append(line);
}
Document doc = Jsoup.parse(sb.toString());

八、最佳實踐

尊重robots.txt：檢查目標網站的robots.txt文件
設置合理的爬取間隔：避免給服務器造成過大壓力
處理異常：妥善處理網絡異常和解析異常
遵守法律法規：確保爬取行為符合相關法律法規
使用日志記錄：記錄爬取過程中的重要信息
資源釋放：及時關閉連接和釋放資源

九、與其他庫的比較

特性	jsoup	HtmlUnit	Selenium
執行JavaScript	不支持	支持	支持
輕量級	是	中等	重
學習曲線	低	中等	中等
適用場景	簡單HTML解析	復雜網頁交互	瀏覽器自動化測試
性能	高	中等	低

十、總結

jsoup是一個功能強大且易于使用的HTML解析庫，特別適合Java開發者進行網頁內容提取和操作。通過本文的介紹，你應該已經掌握了：

jsoup的基本使用方法
各種數據提取技術
高級功能和實戰案例
性能優化和問題解決技巧

在實際項目中，建議根據具體需求選擇合適的工具。對于簡單的HTML解析和內容提取，jsoup無疑是最佳選擇之一；對于需要執行JavaScript的復雜頁面，可能需要考慮HtmlUnit或Selenium等工具。

十一、資源推薦

官方文檔
GitHub倉庫
Java爬蟲開發實戰
HTML/CSS選擇器參考

希望本文能幫助你快速掌握jsoup的使用，在實際開發中提高工作效率！ “`

向AI問一下細節

推薦閱讀：

免責聲明：本站發布的內容（圖片、視頻和文字）以原創、轉載和分享為主，文章觀點不代表本網站立場，如果涉及侵權請聯系站長郵箱：is@yisu.com進行舉報，并提供相關證據，一經查實，將立刻刪除涉嫌侵權內容。

上一篇新聞：
win10怎么設置Java環境變量
下一篇新聞：
@Transactional注解怎么用

猜你喜歡

AI
助
手

產品服務

地區劃分

專題活動

幫助支持

關于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關注億速云

億速云公眾號

手機網站二維碼

亚洲午夜精品一区二区_中文无码日韩欧免_久久香蕉精品视频_欧美主播一区二区三区美女