Web Scraping Using HtmlUnit In Java
How To Use The HtmlUnit In Java To Scrape Data?
HtmlUnit is a headless browser for Java. Using this technique, you can directly parse the HTML of any website. You can also check the CSS and JavaScript of the page. It makes it easy to interact with web pages in which you read the content of the pages, clicking buttons and filling forms. In Java, HtmlUnit is a technique to scrape data from a website. Sometimes, we need a huge amount of data, and we are not able to copy paste it. So Java provides this technique to scrap data from dynamic websites.
For using the HtmlUnit in Java, basic understanding of CSS and HTML elements must be required.
How To Setup HtmlUnit In a Java Project?
There are 2 ways to setup HtmlUnit in java project:
1). add HtmlUnit jar file with all dependencies in your java project. For download use below link:
→ https://jar-download.com/artifacts/net.sourceforge.htmlunit/htmlunit/2.9/source-code
2). add HtmlUnit maven dependency in pom.xml file of your java project
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.9</version>
</dependency>
After the setup of HtmlUnit in your java project. Lets start the code:
HtmlUnit uses WebClient class to get the web page. In this you can also pass a specific browser name as like Chrome
→ WebClient webClient = new WebClient(BrowserVersion.CHROME);
if you want to disable CSS and javascript then set it to false
→ webClient.getOptions().setCssEnabled(false);
→ webClient.getOptions().setJavaScriptEnabled(false);
For fetching the web page content, you have to put the web page url
→ HtmlPage page = webClient.getPage(“https://xyz.org/market-watch/live_quotes");
In Html Unit, we can scrap a page using 3 ways:
1. Using DOM methods- getElementById(), getElementByName()
it returns List of DomElements or DomElement object.
→ DomElement firstHeading = page.getElementById(“firstHeading”);
→ System.out.print(firstHeading.asNormalizedText())
2. Using XPath- getByXPath(), getFirstByXPath()
instead of DomElement object it returns HtmlElement object.
→ HtmlElement price = page.getFirstByXPath(“//div[@class=\”content-wrap clearfix\”]/h1");
→ System.out.print(price.asNormalizedText());
3. Using CSS selectors- querySelector(), querySelectorAll()
it return DomNode or DomNodeList<DomNode> object.
→ String selector = “.price-download tbody tr”;
→ DomNodeList<DomNode> rows = page.querySelectorAll(selector);
→ for (DomNode row : rows) {
→ String price = row.querySelector(“td:nth-child(2) a”).asNormalizedText();
→ String commodityName = row.querySelector(“td:nth-child(3) a”).asNormalizedText();
→ String quantity = row.querySelector(“td:nth-child(4)”).asNormalizedText();
→ System.out.println(price + “\t “ + commodityName + “\t “ + quantity);
→ }
Below is the java code for scraping content of the webpage :
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class MarketPrice {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
//initialize a headless browser
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//configuring options
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
//fetching the web page
HtmlPage page = webClient.getPage(“https://xyz.org/market-watch/live_quotes");
HtmlElement price = page.getFirstByXPath(“//div[@class=\”content-wrap clearfix\”]/h1");
System.out.print(price.asNormalizedText());
}
}
We, at Oodles, provide custom ERP application development services with a focus on automating complex business workflows and repetitive tasks. To learn more about our custom ERP software development services, reach out at erp@oodles.io.