Thursday, 28 September 2017

Reading a web page in Java

Reading a web page in Java is a tutorial that presents several ways to to read a web page in Java. It contains six examples of downloading an HTTP source from a tiny web page.

Java reading web page tools

Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.

In the following examples, we download HTML source from the something.com tiny web page.

Reading a web page with URL

URL represents a Uniform Resource Locator, a pointer to a resource on the World Wide Web.

ReadWebPageEx.java

package com.zetcode;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

public class ReadWebPageEx {

    public static void main(String[] args) throws MalformedURLException, IOException {

        BufferedReader br = null;

        try {

            URL url = new URL("http://www.something.com");
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String line;

            StringBuilder sb = new StringBuilder();

            while ((line = br.readLine()) != null) {

                sb.append(line);
                sb.append(System.lineSeparator());
            }

            System.out.println(sb);
        } finally {

            if (br != null) {
                br.close();
            }
        }
    }
}

The code example reads the contents of a web page.

br = new BufferedReader(new InputStreamReader(url.openStream()));

The openStream() method opens a connection to the specified url and returns an InputStream for reading from that connection. The InputStreamReader is a bridge from byte streams to character streams. It reads bytes and decodes them into characters using a specified charset. In addition, BufferedReader is used for better performance.

StringBuilder sb = new StringBuilder();

while ((line = br.readLine()) != null) {

    sb.append(line);
    sb.append(System.lineSeparator());
}

The HTML data is read line by line with the readLine() method. The source is appended to the StringBuilder.

System.out.println(sb);

In the end, the contents of the StringBuilder are printed to the terminal.

Reading a web page with JSoup

JSoup is a popular Java HTML parser.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

We have used this Maven dependency.

ReadWebPageEx2.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;

public class ReadWebPageEx2 {

    public static void main(String[] args) throws IOException {

        String webPage = "http://www.something.com";
        
        String html = Jsoup.connect(webPage).get().html();
        
        System.out.println(html);
    }
}

The code example uses JSoup to download and print a tiny web page.

String html = Jsoup.connect(webPage).get().html();

The connect() method connects to the specified web page. The get() method issues a GET request. Finally, the html() method retrieves the HTML source.

Reading a web page with HtmlCleaner

HtmlCleaner is an open source HTML parser written in Java.

<dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.16</version>
</dependency>

For this example, we use the htmlcleaner Maven dependency.

ReadWebPageEx3.java

package com.zetcode;

import java.io.IOException;
import java.net.URL;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;

public class ReadWebPageEx3 {

    public static void main(String[] args) throws IOException {

        URL url = new URL("http://www.something.com");

        CleanerProperties props = new CleanerProperties();
        props.setOmitXmlDeclaration(true);
        
        HtmlCleaner cleaner = new HtmlCleaner(props);
        TagNode node = cleaner.clean(url);

        SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
        htmlSerializer.writeToStream(node, System.out);        
    }
}

The example uses HtmlCleaner to download a web page.

CleanerProperties props = new CleanerProperties();
props.setOmitXmlDeclaration(true);

In the properties, we set to omit the XML declaration.

SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
htmlSerializer.writeToStream(node, System.out);

A SimpleHtmlSerializer creates the resulting HTML without any indenting and/or compacting.

Reading a web page with Apache HttpClient

Apache HttpClient is a HTTP/1.1 compliant HTTP agent implementation. It can scrape a web page using the request and response process. An HTTP client implements the client side of the HTTP and HTTPS protocols.

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

We use this Maven dependency for the Apache HTTP client.

ReadWebPageEx4.java

package com.zetcode;

import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

public class ReadWebPageEx4 {

    public static void main(String[] args) throws IOException {

        HttpGet request = null;

        try {

            String url = "http://www.something.com";
            HttpClient client = HttpClientBuilder.create().build();
            request = new HttpGet(url);

            request.addHeader("User-Agent", "Apache HTTPClient");
            HttpResponse response = client.execute(request);

            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity);
            System.out.println(content);

        } finally {

            if (request != null) {

                request.releaseConnection();
            }
        }
    }
}

In the code example, we send a GET HTTP request to the specified web page and receive an HTTP response. From the response, we read the HTML source.

HttpClient client = HttpClientBuilder.create().build();

An HttpClient is built.

request = new HttpGet(url);

HttpGet is a class for the HTTP GET method.

request.addHeader("User-Agent", "Apache HTTPClient");
HttpResponse response = client.execute(request);

A GET method is executed and an HttpResponse is received.

HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
System.out.println(content);

From the response, we get the content of the web page.

Reading a web page with Jetty HttpClient

Jetty project has an HTTP client as well.

<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-client</artifactId>
    <version>9.4.0.M1</version>
</dependency>

This is a Maven dependency for the Jetty HTTP client.

ReadWebPageEx5.java

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();
            
            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

In the example, we get the HTML source of a web page with the Jetty HTTP client.

client = new HttpClient();
client.start();

An HttpClient is created and started.

ContentResponse res = client.GET(url);

A GET request is issued to the specified URL.

System.out.println(res.getContentAsString());

The content is retrieved from the response with the getContentAsString() method.

Reading a web page with HtmlUnit

HtmlUnit is a Java unit testing framework for testing Web based applications.

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.23</version>
</dependency>

We use this Maven dependency.

ReadWebPageEx6.java

package com.zetcode;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;

public class ReadWebPageEx6 {

    public static void main(String[] args) throws IOException {
        
        try (WebClient webClient = new WebClient()) {
            
            String url = "http://www.something.com";
            
            HtmlPage page = webClient.getPage(url);
            WebResponse response = page.getWebResponse();
            String content = response.getContentAsString();
            
            System.out.println(content);
        }
    }
}

The example downloads a web page and prints it using the HtmlUnit library.

In this article, we have scraped a web page in Java using various tools, including URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.

http://zetcode.com/articles/javareadwebpage/

https://jsoup.org/

Java URL

URL Encoding basic

import java.net.URLEncoder;

public class URLTesting {
public static void main(String args[]) throws Exception{
String para1Before="Java Programming";
String para1After=URLEncoder.encode(para1Before,"UTF-8");

System.out.println(para1Before);
System.out.println(para1After);

String para2Before = "Java, Programming Tutorial";

String para2After=URLEncoder.encode(para2Before, "UTF-8");

System.out.println(para2Before);
System.out.println(para2After);

}
}

Java Programming
Java+Programming
Java, Programming Tutorial
Java%2C+Programming+Tutorial

// space is replace by +
// , is replaced by %2c

--------------------------------------------------------------------------------------------------------------

URL Decoding

import java.net.URLDecoder;

public class URLDecodingTest {

public static void main(String args[]) throws Exception{

String para1="Java+Programming";
String para2="Java%2C+Programming+Tutorial";

String para1After=URLDecoder.decode(para1,"UTF-8");
String para2After=URLDecoder.decode(para2,"UTF-8");

System.out.println(para1After);
System.out.println(para2After);

}

}

Java Programming

Java, Programming Tutorial

--------------------------------------------------------------------------------------------------------------

Reading directly from URL

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

public class ReadingFromURL {

public static void main(String args[]) throws Exception{

URL url =new URL("http://www.oracle.com/");
BufferedReader br =new BufferedReader(
new InputStreamReader(url.openStream()));

String inputLine;

while((inputLine=br.readLine())!=null);
System.out.println(inputLine);

br.close();
}

}

--------------------------------------------------------------------------------------------------------------

Convert URI to URL

import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;

public class ConvertURItoURL {

public static void main(String args[]) throws URISyntaxException, MalformedURLException{

URI uri =new URI("http", "java.com", "/hello world/", "");

URL url=uri.toURL();
System.out.println(url.toString());
}
}

java.net.URI class to automatically take care of the encoding

--------------------------------------------------------------------------------------------------------------

Parsing URL

import java.net.MalformedURLException;
import java.net.URL;

public class ParseURL {

public static void main(String args[]) throws MalformedURLException{

URL url =new URL("http://java.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");

System.out.println(url.getProtocol());

System.out.println(url.getAuthority());

System.out.println(url.getHost());

System.out.println(url.getPort());

System.out.println(url.getPath());

System.out.println(url.getQuery());

System.out.println(url.getFile());

System.out.println(url.getRef());

}

}

http
java.com:80
java.com
80
/docs/books/tutorial/index.html
name=networking
/docs/books/tutorial/index.html?name=networking

DOWNLOADING

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

https://docs.oracle.com/javase/tutorial/networking/urls/index.html

Java JSON

Parse JSON in Java

import org.json.*;

public class JsonTest {

public static void main(String args[]){
String str = "{ \"name\": \"Alice\", \"age\": 20 }";

JSONObject obj = new JSONObject(str);
String name= obj.getString("name");
int age = obj.getInt("age");
System.out.println("name: "+name);
System.out.println("age: "+age);

}
}

--------------------------------------------------------------------------------------------------------------

JSON Array

import org.json.*;

public class JSONArrayTest {

public static void main(String args[]){

String str = "{ \"number\": [3, 4, 5, 6] }";

JSONObject obj = new JSONObject(str);
JSONArray arr=obj.getJSONArray("number");

for(int i=0;i<arr.length();i++){
System.out.print(arr.getInt(i));
}

}
}

--------------------------------------------------------------------------------------------------------------

Google Maps API provides geocoding services to find the latitude/longitude of an address. The service returns results in JSON. The following method geocoding() does the following:

Build a URL to access the geocoding service.
Read from the URL.
Build a JSON object for the content.
Retrieve the first result from an array of results.
Print out the information.

Geo Coding

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLEncoder;
import java.util.Scanner;

import org.json.JSONObject;

public class GeoCoding {

public static void main(String args[]) throws IOException{

String addr ="1600+Amphitheatre+Parkway,+Mountain+View,+CA";
String s= "http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false";
s+=URLEncoder.encode(addr,"UTF-8");
URL url=new URL(s);

// read from the URL
Scanner scan = new Scanner(url.openStream());
String str = new String();
while (scan.hasNext())
str += scan.nextLine();
scan.close();

// build a JSON object
JSONObject obj = new JSONObject(str);
if (! obj.getString("status").equals("OK"))
return;

// get the first result
JSONObject res = obj.getJSONArray("results").getJSONObject(0);
System.out.println(res.getString("formatted_address"));
JSONObject loc =
res.getJSONObject("geometry").getJSONObject("location");
System.out.println("lat: " + loc.getDouble("lat") +
", lng: " + loc.getDouble("lng"));
}

}

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------

http://mvnrepository.com/artifact/org.json/json/20160810

----------------------------------------------------------------------------------------

https://stackoverflow.com/questions/2591098/how-to-parse-json

http://theoryapp.com/parse-json-in-java/

Geocoding

Google Maps API provides geocoding services to find the latitude/longitude of an address. The service returns results in JSON. The following method geocoding() does the following:

Build a URL to access the geocoding service.
Read from the URL.
Build a JSON object for the content.
Retrieve the first result from an array of results.
Print out the information.

public static void geocoding(String addr) throws Exception

{

    // build a URL

    String s = "http://maps.google.com/maps/api/geocode/json?" +

                    "sensor=false&address=";

    s += URLEncoder.encode(addr, "UTF-8");

    URL url = new URL(s);

    // read from the URL

    Scanner scan = new Scanner(url.openStream());

    String str = new String();

    while (scan.hasNext())

        str += scan.nextLine();

    scan.close();

    // build a JSON object

    JSONObject obj = new JSONObject(str);

    if (! obj.getString("status").equals("OK"))

        return;

    // get the first result

    JSONObject res = obj.getJSONArray("results").getJSONObject(0);

    System.out.println(res.getString("formatted_address"));

    JSONObject loc =

        res.getJSONObject("geometry").getJSONObject("location");

    System.out.println("lat: " + loc.getDouble("lat") +

                        ", lng: " + loc.getDouble("lng"));

}

Example:

http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false

------------------------------------------------

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "Google Building 42",
               "short_name" : "Google Bldg 42",
               "types" : [ "premise" ]
            },
            {
               "long_name" : "1600",
               "short_name" : "1600",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "Amphitheatre Parkway",
               "short_name" : "Amphitheatre Pkwy",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Mountain View",
               "short_name" : "Mountain View",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Santa Clara County",
               "short_name" : "Santa Clara County",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "California",
               "short_name" : "CA",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "United States",
               "short_name" : "US",
               "types" : [ "country", "political" ]
            },
            {
               "long_name" : "94043",
               "short_name" : "94043",
               "types" : [ "postal_code" ]
            }
         ],
         "formatted_address" : "Google Bldg 42, 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA",
         "geometry" : {
            "bounds" : {
               "northeast" : {
                  "lat" : 37.42198310000001,
                  "lng" : -122.0853195
               },
               "southwest" : {
                  "lat" : 37.4214139,
                  "lng" : -122.0860042
               }
            },
            "location" : {
               "lat" : 37.4216548,
               "lng" : -122.0856374
            },
            "location_type" : "ROOFTOP",
            "viewport" : {
               "northeast" : {
                  "lat" : 37.4230474802915,
                  "lng" : -122.0843128697085
               },
               "southwest" : {
                  "lat" : 37.4203495197085,
                  "lng" : -122.0870108302915
               }
            }
         },
         "place_id" : "ChIJPzxqWQK6j4AR3OFRJ6LMaKo",
         "types" : [ "premise" ]
      }
   ],
   "status" : "OK"
}

------------------------------------------------