Thursday, 28 September 2017

Reading a web page in Java

Reading a web page in Java

 Reading a web page in Java is a tutorial that presents several ways to to read a web page in Java. It contains six examples of downloading an HTTP source from a tiny web page.

Java reading web page tools

Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
In the following examples, we download HTML source from the something.com tiny web page.

Reading a web page with URL

URL represents a Uniform Resource Locator, a pointer to a resource on the World Wide Web.
ReadWebPageEx.java
package com.zetcode;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

public class ReadWebPageEx {

    public static void main(String[] args) throws MalformedURLException, IOException {

        BufferedReader br = null;

        try {

            URL url = new URL("http://www.something.com");
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String line;

            StringBuilder sb = new StringBuilder();

            while ((line = br.readLine()) != null) {

                sb.append(line);
                sb.append(System.lineSeparator());
            }

            System.out.println(sb);
        } finally {

            if (br != null) {
                br.close();
            }
        }
    }
}
The code example reads the contents of a web page.
br = new BufferedReader(new InputStreamReader(url.openStream()));
The openStream() method opens a connection to the specified url and returns an InputStream for reading from that connection. The InputStreamReader is a bridge from byte streams to character streams. It reads bytes and decodes them into characters using a specified charset. In addition, BufferedReader is used for better performance.
StringBuilder sb = new StringBuilder();

while ((line = br.readLine()) != null) {

    sb.append(line);
    sb.append(System.lineSeparator());
}
The HTML data is read line by line with the readLine() method. The source is appended to the StringBuilder.
System.out.println(sb);
In the end, the contents of the StringBuilder are printed to the terminal.

Reading a web page with JSoup

JSoup is a popular Java HTML parser.
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>
We have used this Maven dependency.
ReadWebPageEx2.java
package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;

public class ReadWebPageEx2 {

    public static void main(String[] args) throws IOException {

        String webPage = "http://www.something.com";
        
        String html = Jsoup.connect(webPage).get().html();
        
        System.out.println(html);
    }
}
The code example uses JSoup to download and print a tiny web page.
String html = Jsoup.connect(webPage).get().html();
The connect() method connects to the specified web page. The get() method issues a GET request. Finally, the html() method retrieves the HTML source.

Reading a web page with HtmlCleaner

HtmlCleaner is an open source HTML parser written in Java.
<dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.16</version>
</dependency>
For this example, we use the htmlcleaner Maven dependency.
ReadWebPageEx3.java
package com.zetcode;

import java.io.IOException;
import java.net.URL;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;

public class ReadWebPageEx3 {

    public static void main(String[] args) throws IOException {

        URL url = new URL("http://www.something.com");

        CleanerProperties props = new CleanerProperties();
        props.setOmitXmlDeclaration(true);
        
        HtmlCleaner cleaner = new HtmlCleaner(props);
        TagNode node = cleaner.clean(url);

        SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
        htmlSerializer.writeToStream(node, System.out);        
    }
}
The example uses HtmlCleaner to download a web page.
CleanerProperties props = new CleanerProperties();
props.setOmitXmlDeclaration(true);
In the properties, we set to omit the XML declaration.
SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
htmlSerializer.writeToStream(node, System.out);    
SimpleHtmlSerializer creates the resulting HTML without any indenting and/or compacting.

Reading a web page with Apache HttpClient

Apache HttpClient is a HTTP/1.1 compliant HTTP agent implementation. It can scrape a web page using the request and response process. An HTTP client implements the client side of the HTTP and HTTPS protocols.
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>
We use this Maven dependency for the Apache HTTP client.
ReadWebPageEx4.java
package com.zetcode;

import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

public class ReadWebPageEx4 {

    public static void main(String[] args) throws IOException {

        HttpGet request = null;

        try {

            String url = "http://www.something.com";
            HttpClient client = HttpClientBuilder.create().build();
            request = new HttpGet(url);

            request.addHeader("User-Agent", "Apache HTTPClient");
            HttpResponse response = client.execute(request);

            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity);
            System.out.println(content);

        } finally {

            if (request != null) {

                request.releaseConnection();
            }
        }
    }
}
In the code example, we send a GET HTTP request to the specified web page and receive an HTTP response. From the response, we read the HTML source.
HttpClient client = HttpClientBuilder.create().build();
An HttpClient is built.
request = new HttpGet(url);
HttpGet is a class for the HTTP GET method.
request.addHeader("User-Agent", "Apache HTTPClient");
HttpResponse response = client.execute(request);
A GET method is executed and an HttpResponse is received.
HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
System.out.println(content);
From the response, we get the content of the web page.

Reading a web page with Jetty HttpClient

Jetty project has an HTTP client as well.
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-client</artifactId>
    <version>9.4.0.M1</version>
</dependency>
This is a Maven dependency for the Jetty HTTP client.
ReadWebPageEx5.java
package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();
            
            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}
In the example, we get the HTML source of a web page with the Jetty HTTP client.
client = new HttpClient();
client.start();
An HttpClient is created and started.
ContentResponse res = client.GET(url);
A GET request is issued to the specified URL.
System.out.println(res.getContentAsString());
The content is retrieved from the response with the getContentAsString() method.

Reading a web page with HtmlUnit

HtmlUnit is a Java unit testing framework for testing Web based applications.
<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.23</version>
</dependency>
We use this Maven dependency.
ReadWebPageEx6.java
package com.zetcode;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;

public class ReadWebPageEx6 {

    public static void main(String[] args) throws IOException {
        
        try (WebClient webClient = new WebClient()) {
            
            String url = "http://www.something.com";
            
            HtmlPage page = webClient.getPage(url);
            WebResponse response = page.getWebResponse();
            String content = response.getContentAsString();
            
            System.out.println(content);
        }
    }
}
The example downloads a web page and prints it using the HtmlUnit library.
In this article, we have scraped a web page in Java using various tools, including URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.





No comments:

Post a Comment