Subscribe to Munawwar's Lab        RSS Feed
-----

HTML Parsing (again)

Icon 2 Comments
I have previously posted a blog entry on HTML Parsing. The approach I took was to use swing HTMLEditorKit's SAX parser.
The overall objective was to get specific data from the HTML page (like getting all the links from the page).

The problem with the previous method
Recently, for a similar task I had to parse HTML files. I felt that the SAX parsing method is too much work for a simple task and furthermore the code lacks flexibility. If I need to, say, search for specific <a> tags that are child elements of <div> tag with class="links", I'd need to change a whole lot of code before I can achieve this task.

Solution
I figured out that these sort of tasks can be easily done with XML and XPath tools. In the above mentioned case, I could search for the elements with a simple XPath expression //div[@class='links']/a.
But I can't use XML tools on HTML files unless I could convert HTML into well-formed XML documents....:light bulb:!!
After searching a bit and trying many different tools, I've found HTMLCleaner to be a perfect fit for the task. It's quite simple to use. All we need to do is use a system call to execute the jar with appropriate command options (read the documentation for the options) and we get the HTML in XML format.

I hope the idea is clear. We convert HTML to XML using HTMLCleaner and use XPath to find the required elements.

Before we continue, place the htmlcleaner.jar file in 'lib' folder at the working directory of the java project (if lib folder doesn't exist, then make one). Also for the sake of this mini-tutorial, I have created a test HTML file (placed at the working directory) named test.html.
<!DOCTYPE html>
<html>
	<head>
		<style type="text/css">
			.links a{
				text-decoration: none;
				color: #80b080;
			}
		</style>
	</head>
	<body>
		<div class="links">
			<a href="http://en.wikipedia.org/wiki/Nanotechnology">Nanotechnology</a><br/>
			<a href="http://en.wikipedia.org/wiki/Java_%28programming_language%29">Java</a>
		</div>
	</body>
</html>



Here is the Main.java file. I hope the comments are sufficient to get a basic idea of what I am doing here:
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class Main {
    public static final String fileSep = System.getProperty("file.separator");
    public static final String newLineSeq = System.getProperty("line.separator");
    //Dependency
    public static final String HTMLCLEANER = "lib" + fileSep + "htmlcleaner.jar";
    //Example HTML file
    public static final String SRC_HTML = "test.html";
    public static final String DEST_XML = "test.xml";

    public static void main(String[] args) {
        //Debug message: The current working directory
        System.out.println("Current Directory: " + System.getProperty("user.dir"));
        
        try {
            //Convert HTML files to well-formed XML files
            Process p = Runtime.getRuntime().exec("java -jar " + HTMLCLEANER + " outputtype=pretty src=" + SRC_HTML + " dest=" + DEST_XML);
            p.waitFor();

            //Build DOM out of the XML file
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setNamespaceAware(true); //never forget this!
            DocumentBuilder builder = factory.newDocumentBuilder();
            Document doc = builder.parse(DEST_XML);

            //Use Xpath to find the elements you want
            XPathFactory xpathFactory = XPathFactory.newInstance();
            XPath xpath = xpathFactory.newXPath();
            XPathExpression expr=xpath.compile("//div[@class='links']/a"); //Here's the XPath magic

            //Retreive the elements and iterate through them
            Object result = expr.evaluate(doc, XPathConstants.NODESET);
            NodeList nodes = (NodeList) result;
            for (int i = 0; i < nodes.getLength(); i++) {
                System.out.println("Node name: " + nodes.item(i).getNodeName());
                System.out.println("Text Content : " + nodes.item(i).getTextContent());
                System.out.println("Link: " + nodes.item(i).getAttributes().getNamedItem("href").getNodeValue() + newLineSeq);
            }
        } catch (SAXException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        } catch (XPathExpressionException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        } catch (ParserConfigurationException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        } catch (InterruptedException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}



Output:
Current Directory: K:\Users\Mogral\Documents\NetBeansProjects\Dom_Parser
Node name: a
Text Content : Nanotechnology
Link: http://en.wikipedia.org/wiki/Nanotechnology

Node name: a
Text Content : Java
Link: http://en.wikipedia.org/wiki/Java_%28programming_language%29




And the XML file which HTMLCleaner generates:
<?xml version="1.0" encoding="Cp1252"?>
<html>
	<head>
		<style type="text/css"><![CDATA[
			.links a{
			text-decoration: none;
			color: #80b080;
			}
		]]></style>
	</head>
	<body>
		<div class="links">
			<a href="http://en.wikipedia.org/wiki/Nanotechnology">Nanotechnology</a>
			<br />
			<a href="http://en.wikipedia.org/wiki/Java_%28programming_language%29">Java</a>
		</div>
	</body>
</html>



EDIT:
Additional notes
To make things more flexible, its better to avoid hard-coding the XPath expression.
I would save the expression to a property/configuration file.
Create config.ini within the working directory:
#Configuration File
xpath_expr=//div[@class\='links']/a



...
import java.util.Properties;
import java.io.File;

class Main {
   ...
   public static final String CONFIG_FILE = "config.ini";
   private Properties configFile = new Properties();
   
   public static void main(String[] args) {
      try {
            File fr = new File(CONFIG_FILE);
            if (fr.exists()) {
                configFile.load(new FileReader(CONFIG_FILE));
            } else {
                System.out.println("Property file doesn't exist");
            }
        } catch (IOException ex) {
            Logger.getLogger(SearchAssistView.class.getName()).log(Level.SEVERE, null, ex);
        }
      ...
      ...
      XPathExpression expr=xpath.compile(configFile.getProperty("xpath_expr"));
   }
}

2 Comments On This Entry

Page 1 of 1

Programmist Icon

14 March 2011 - 04:09 AM
A few years ago I had to do something similar for a large media outlet in order to get their content into a feed to the Associated Press and Yahoo News. Problem was they let their authors (non techies) add HTML and JS to the content, Very very bad. HtmlCleaner was among hte many cleaners I tried from this fairly-extensive list:
http://java-source.n...ce/html-parsers

JTidy was another good one.
1

Munawwar Icon

14 March 2011 - 04:24 AM
My first choice was HTML Tidy, but I had to quit it since the XML it generated didn't have the XML declaration as it's first line and I couldn't find an option to do it.
I have forgotten why the line was important...hmm..I think XPath wasn't working without it.
I could have just added the line into the file, but I preferred HTMLCleaner, since it's just one jar file and cross-platform.

The first line at the JTidy's website read "JTidy is a Java port of HTML Tidy...".
So I didn't try JTidy since I have this impression that JTidy shares the same problem as HTML Tidy!? Maybe I am wrong. Should give it a try.
0
Page 1 of 1

Trackbacks for this entry [ Trackback URL ]

There are no Trackbacks for this entry

August 2014

S M T W T F S
     12
3456789
10111213141516
17181920212223
24252627 28 2930
31      

Recent Entries

Recent Comments

Search My Blog

0 user(s) viewing

0 Guests
0 member(s)
0 anonymous member(s)