Subscribe to Munawwar's Lab        RSS Feed
-----

HTML Parsing with Java

Icon Leave Comment
HTML Parsing using swing HTML Parser
Skeleton Program
public class Main {
    public static void main(String[] args) {
       //Open a BufferedReader to download the content of a website
       URL = new URL("http://www.google.com");
       URLConnection conn = URL.openConnection();
       BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
 
       //Create a callback object and register it with the parser
       ParserCallBack callback = new ParserCallBack();
       ParserDelegator delegator = new ParserDelegator();
       delegator.parse(br, callback, true); //Third argument used for ignoring character set
    }
}
class ParserCallBack extends HTMLEditorKit.ParserCallback {
    @Override
    public void handleText(char[] data, int pos){
        
    }
    @Override
    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        
    }
    @Override
    public void handleEndTag(HTML.Tag t, int pos){
        
    }
    //Simple tags are br,img,meta... - in general all tags that doesn't have separate end tags. Example: <br />
    @Override
    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos){
        
    }
}


There are more functions that can be overridden but I shall only use two of them (namely handleStartTag and handleText).

The HTML parser is like an XML SAX parser (don't worry if you never heard of it). That is, it doesn't store the entire DOM in memory, rather, a tag is parsed, passed to the callback and forgotten - once the function ends, there is no way of getting back what was parsed before.
So to store the tags, we'll need a data structure. I will use a stack, since I only require the tag in the handleText function.

Here's an example to print all the links found in the web page. I am also storing all the text found within the web page into a string ('pageText').
public class Main {
    public static void main(String[] args) {
        try {
            URL webURL = new URL("http://www.google.com");
            URLConnection conn = webURL.openConnection();
            BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));

            CallBack callback = new CallBack();
            ParserDelegator delegator = new ParserDelegator();
            delegator.parse(br, callback, true);

            System.out.println(callback.pageText);
        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}

class CallBack extends HTMLEditorKit.ParserCallback {

    Stack<HTML.Tag> stack = new Stack();
    public String pageText = "";

    @Override
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos) {
        //Get a tag and push it onto a stack
        stack.push(tag);
        if (tag.toString().equals("a")) {
            String link = (String) a.getAttribute(HTML.Attribute.HREF);
            if (link != null && link.length() > 0) {
                System.out.println(link);
            }
        }
    }

    @Override
    public void handleEndTag(HTML.Tag t, int pos) {
    }

    @Override
    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    }

    @Override
    public void handleText(char[] data, int pos) {
        // pop the stack to get the latest tag processed.
        if (!stack.empty()) {
            String tagName = stack.pop().toString();
            String strData = new String(data);
            pageText += strData + " ";
        }
    }
}

0 Comments On This Entry

 

Trackbacks for this entry [ Trackback URL ]

There are no Trackbacks for this entry

December 2014

S M T W T F S
 123456
78910111213
14151617181920
212223242526 27
28293031   

Recent Entries

Recent Comments

Search My Blog

0 user(s) viewing

0 Guests
0 member(s)
0 anonymous member(s)