I have previously posted a blog entry on HTML Parsing. The approach I took was to use swing HTMLEditorKit's SAX parser.
The overall objective was to get specific data from the HTML page (like getting all the links from the page).
The problem with the previous method
Recently, for a similar task I had to parse HTML files. I felt that the SAX parsing method is too much work for a simple task and furthermore the code lacks flexibility. If I need to, say, search for specific <a> tags that are child elements of <div> tag with class="links", I'd need to change a whole lot of code before I can achieve this task.
Solution
I figured out that these sort of tasks can be easily done with XML and XPath tools. In the above mentioned case, I could search for the elements with a simple XPath expression //div[@class='links']/a.
But I can't use XML tools on HTML files unless I could convert HTML into well-formed XML documents....:light bulb:!!
After searching a bit and trying many different tools, I've found HTMLCleaner to be a perfect fit for the task. It's quite simple to use. All we need to do is use a system call to execute the jar with appropriate command options (read the documentation for the options) and we get the HTML in XML format.
I hope the idea is clear. We convert HTML to XML using HTMLCleaner and use XPath to find the required elements.
Before we continue, place the htmlcleaner.jar file in 'lib' folder at the working directory of the java project (if lib folder doesn't exist, then make one). Also for the sake of this mini-tutorial, I have created a test HTML file (placed at the working directory) named test.html.
Here is the Main.java file. I hope the comments are sufficient to get a basic idea of what I am doing here:
Output:
And the XML file which HTMLCleaner generates:
EDIT:
Additional notes
To make things more flexible, its better to avoid hard-coding the XPath expression.
I would save the expression to a property/configuration file.
Create config.ini within the working directory:
The overall objective was to get specific data from the HTML page (like getting all the links from the page).
The problem with the previous method
Recently, for a similar task I had to parse HTML files. I felt that the SAX parsing method is too much work for a simple task and furthermore the code lacks flexibility. If I need to, say, search for specific <a> tags that are child elements of <div> tag with class="links", I'd need to change a whole lot of code before I can achieve this task.
Solution
I figured out that these sort of tasks can be easily done with XML and XPath tools. In the above mentioned case, I could search for the elements with a simple XPath expression //div[@class='links']/a.
But I can't use XML tools on HTML files unless I could convert HTML into well-formed XML documents....:light bulb:!!
After searching a bit and trying many different tools, I've found HTMLCleaner to be a perfect fit for the task. It's quite simple to use. All we need to do is use a system call to execute the jar with appropriate command options (read the documentation for the options) and we get the HTML in XML format.
I hope the idea is clear. We convert HTML to XML using HTMLCleaner and use XPath to find the required elements.
Before we continue, place the htmlcleaner.jar file in 'lib' folder at the working directory of the java project (if lib folder doesn't exist, then make one). Also for the sake of this mini-tutorial, I have created a test HTML file (placed at the working directory) named test.html.
<!DOCTYPE html>
<html>
<head>
<style type="text/css">
.links a{
text-decoration: none;
color: #80b080;
}
</style>
</head>
<body>
<div class="links">
<a href="http://en.wikipedia.org/wiki/Nanotechnology">Nanotechnology</a><br/>
<a href="http://en.wikipedia.org/wiki/Java_%28programming_language%29">Java</a>
</div>
</body>
</html>
Here is the Main.java file. I hope the comments are sufficient to get a basic idea of what I am doing here:
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class Main {
public static final String fileSep = System.getProperty("file.separator");
public static final String newLineSeq = System.getProperty("line.separator");
//Dependency
public static final String HTMLCLEANER = "lib" + fileSep + "htmlcleaner.jar";
//Example HTML file
public static final String SRC_HTML = "test.html";
public static final String DEST_XML = "test.xml";
public static void main(String[] args) {
//Debug message: The current working directory
System.out.println("Current Directory: " + System.getProperty("user.dir"));
try {
//Convert HTML files to well-formed XML files
Process p = Runtime.getRuntime().exec("java -jar " + HTMLCLEANER + " outputtype=pretty src=" + SRC_HTML + " dest=" + DEST_XML);
p.waitFor();
//Build DOM out of the XML file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true); //never forget this!
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(DEST_XML);
//Use Xpath to find the elements you want
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
XPathExpression expr=xpath.compile("//div[@class='links']/a"); //Here's the XPath magic
//Retreive the elements and iterate through them
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println("Node name: " + nodes.item(i).getNodeName());
System.out.println("Text Content : " + nodes.item(i).getTextContent());
System.out.println("Link: " + nodes.item(i).getAttributes().getNamedItem("href").getNodeValue() + newLineSeq);
}
} catch (SAXException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (XPathExpressionException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (ParserConfigurationException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (InterruptedException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
Output:
Current Directory: K:\Users\Mogral\Documents\NetBeansProjects\Dom_Parser Node name: a Text Content : Nanotechnology Link: http://en.wikipedia.org/wiki/Nanotechnology Node name: a Text Content : Java Link: http://en.wikipedia.org/wiki/Java_%28programming_language%29
And the XML file which HTMLCleaner generates:
<?xml version="1.0" encoding="Cp1252"?>
<html>
<head>
<style type="text/css"><![CDATA[
.links a{
text-decoration: none;
color: #80b080;
}
]]></style>
</head>
<body>
<div class="links">
<a href="http://en.wikipedia.org/wiki/Nanotechnology">Nanotechnology</a>
<br />
<a href="http://en.wikipedia.org/wiki/Java_%28programming_language%29">Java</a>
</div>
</body>
</html>
EDIT:
Additional notes
To make things more flexible, its better to avoid hard-coding the XPath expression.
I would save the expression to a property/configuration file.
Create config.ini within the working directory:
#Configuration File xpath_expr=//div[@class\='links']/a
...
import java.util.Properties;
import java.io.File;
class Main {
...
public static final String CONFIG_FILE = "config.ini";
private Properties configFile = new Properties();
public static void main(String[] args) {
try {
File fr = new File(CONFIG_FILE);
if (fr.exists()) {
configFile.load(new FileReader(CONFIG_FILE));
} else {
System.out.println("Property file doesn't exist");
}
} catch (IOException ex) {
Logger.getLogger(SearchAssistView.class.getName()).log(Level.SEVERE, null, ex);
}
...
...
XPathExpression expr=xpath.compile(configFile.getProperty("xpath_expr"));
}
}
2 Comments On This Entry
Page 1 of 1
Programmist
14 March 2011 - 04:09 AM
A few years ago I had to do something similar for a large media outlet in order to get their content into a feed to the Associated Press and Yahoo News. Problem was they let their authors (non techies) add HTML and JS to the content, Very very bad. HtmlCleaner was among hte many cleaners I tried from this fairly-extensive list:
http://java-source.n...ce/html-parsers
JTidy was another good one.
http://java-source.n...ce/html-parsers
JTidy was another good one.
Page 1 of 1
Trackbacks for this entry [ Trackback URL ]
My Blog Links
Recent Entries
-
-
-
-
HTML Parsing (again)on Mar 11 2011 04:00 AM
-
Recent Comments
Search My Blog
0 user(s) viewing
0 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)
|
|



2 Comments









|