Java - Efficient XML Parsing (Thousands of XMLs from Web)

  • (2 Pages)
  • +
  • 1
  • 2

25 Replies - 2251 Views - Last Post: 08 October 2012 - 03:43 AM Rate Topic: -----

#1 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 09:17 AM

Hello Guys. I am having trouble finding a good way to efficiently parse through XML files. I need to parse through about 8000-9000 different URLs to see if they are xmls, and if they are, retrieving information. I have tried to implement VDT-XML but with no luck on webpage xml files. Here is the code below on how I am currently doing it, but it takes well over 30 mins before it is finished going through all those pages. I found examples of people using VDT and other parsers, but they all use files already on the computer, whereas I am trying to get them from a URL.

Thanks

try {
        		byte[] data = new byte[16384];
        		byte[] xmlByte;
        		String xmlString;
        		this.dataFromURL = new ArrayList<String>();
        		this.list = new BusStopList();
        		URL url = null;
        		float percentage;
        		int longitude;
    			int latitude;
    			
        		for(int i = 1000; i < 10000; i++)
        		{
        			url = new URL(params[0] + i + ".xml");
	    			URLConnection ucon = url.openConnection();
	    			InputStream is = ucon.getInputStream();
	    			ByteArrayOutputStream buffer = new ByteArrayOutputStream();
	    			int nRead;
	    			
	    			
	    			while((nRead = is.read(data, 0, data.length)) != -1)
	    			{
	    				buffer.write(data, 0, nRead);
	    			}
	    			
	    			buffer.flush();
	    			
	    			xmlByte = buffer.toByteArray();
	    			xmlString = new String(xmlByte);
	    			
	    			buffer.close();
	    			is.close();	    			
	                 			
	    			if(!xmlString.contains("invalid"))
	    			{
		    			xmlString = xmlString.replaceAll("\\<.*?>", "");
		    			xmlString = xmlString.replaceAll("  ", "");
		    			xmlString = xmlString.replaceAll("\n\n\n", "\n\n");
		    			xmlString = xmlString.replaceAll("amp;", "");
		    			xmlString = xmlString.replaceAll("Street", "St");
		    			xmlString = xmlString.replaceAll("Avenue", "Ave");
		    			
		    			Collections.addAll(dataFromURL, xmlString.split("\n"));
		    			
		    			this.dataFromURL.remove(0);
		    			for(int j = 0; i < this.dataFromURL.size(); i++)
		    			{
		    				if(this.dataFromURL.get(j).isEmpty())
		    					this.dataFromURL.remove(j);
		    			}
		    				    			
		    			longitude = (int)(Double.parseDouble(this.dataFromURL.get(2)) *1E6);
		    			latitude = (int)(Double.parseDouble(this.dataFromURL.get(1)) * 1E6);
		    			
		    			list.addMapPoint(new MapPoint(longitude, latitude,
		    					this.dataFromURL.get(dataFromURL.size() - 1), 
		    					this.dataFromURL.get(3)));
	    			}
	    			
	    			this.dataFromURL.clear();
	    			percentage = ((float)(i-1000) / (float)8999) * 100;
	    			publishProgress(Float.valueOf(percentage).intValue());
	                // Escape early if cancel() is called
	                if (isCancelled()) break;
	    			
        		}
    			
    		} catch (MalformedURLException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		} catch (IOException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}


Is This A Good Question/Topic? 0
  • +

Replies To: Java - Efficient XML Parsing (Thousands of XMLs from Web)

#2 g00se  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2781
  • View blog
  • Posts: 11,768
  • Joined: 20-September 08

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 09:21 AM

And if they are xml - do they have an xml declaration?
Was This Post Helpful? 0
  • +
  • -

#3 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7884
  • View blog
  • Posts: 13,400
  • Joined: 19-March 11

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 09:36 AM

Have you considered using an XML parser? There are many to choose from, and they're going to do a better job of parsing than anything home-rolled.
Was This Post Helpful? 2
  • +
  • -

#4 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 09:41 AM

View Postg00se, on 04 October 2012 - 10:21 AM, said:

And if they are xml - do they have an xml declaration?


This is what the XML looks like:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<hash>
<stop-lat type="float">53.5463</stop-lat>
<stop-lon type="float">-113.506</stop-lon>
<stop-id>1989</stop-id>
<trips-departing type="array">
<trips-departing>
<bus>8</bus>
<time>10:38:00</time>
</trips-departing>
<trips-departing>
<bus>2</bus>
<time>10:38:00</time>
</trips-departing>
<trips-departing>
<bus>111</bus>
<time>10:41:00</time>
</trips-departing>
<trips-departing>
<bus>15</bus>
<time>10:48:00</time>
</trips-departing>
<trips-departing>
<bus>2</bus>
<time>10:52:00</time>
</trips-departing>
<trips-departing>
<bus>8</bus>
<time>10:53:00</time>
</trips-departing>
</trips-departing>
<stop-name>108 Street & 104 Avenue nearside</stop-name>
</hash>


View Postjon.kiparsky, on 04 October 2012 - 10:36 AM, said:

Have you considered using an XML parser? There are many to choose from, and they're going to do a better job of parsing than anything home-rolled.


I have tried VDT but it didn't work.
Was This Post Helpful? 0
  • +
  • -

#5 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7884
  • View blog
  • Posts: 13,400
  • Joined: 19-March 11

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 10:10 AM

There are a lot of good XML parsing libraries. I don't know VDT, but I've had good luck with jdom. For your use case, I think a SAX parser is what you want. My logic for that is pretty simple: A DOM parser creates an in-memory representation of your document, allowing you to navigate it at whim, while a SAX parser goes through it piece by piece and does its work on the fly - for a large data set, I'd think you want SAX.
I'm not an XML expert, though, I've just had to do this a few times. Someone more knowledgeable than I might have better advice.
Was This Post Helpful? 0
  • +
  • -

#6 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 11:20 AM

View Postjon.kiparsky, on 04 October 2012 - 11:10 AM, said:

There are a lot of good XML parsing libraries. I don't know VDT, but I've had good luck with jdom. For your use case, I think a SAX parser is what you want. My logic for that is pretty simple: A DOM parser creates an in-memory representation of your document, allowing you to navigate it at whim, while a SAX parser goes through it piece by piece and does its work on the fly - for a large data set, I'd think you want SAX.
I'm not an XML expert, though, I've just had to do this a few times. Someone more knowledgeable than I might have better advice.


I just implemented SAXX and it is just as slow. I dont know if its my internet connection its just the method. I followed this tutorial:

http://www.dreaminco...ava-part-1-sax/

and my doInBackground method is now:

this.list = new BusStopList();
			MapPoint m = null;
			String xml = null;
			float percentage;
			for(int i = 1000; i < 10000; i++)
			{
				xml = getXmlFromUrl(params[0] + i + ".xml");
				if(!xml.contains("invalid"))
				{
					InputStream stream = new ByteArrayInputStream(xml.getBytes());
					m = SAXXMLParser.parse(stream);
					list.addMapPoint(m);
	                // Escape early if cancel() is called
	                if (isCancelled()) break;
	                //System.out.println(list.getList().get(list.getSize()-1).getBusStopNumber());
				}
				percentage = ((float)(i-1000) / (float)8999) * 100;
    			publishProgress(Float.valueOf(percentage).intValue(), (i-1000));
			}
                        return this.list;
		}

Was This Post Helpful? 0
  • +
  • -

#7 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5882
  • View blog
  • Posts: 12,761
  • Joined: 16-October 07

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 12:03 PM

You are downloading the entire thing before you even check, it looks like:
String xml = getXmlFromUrl(params[0] + i + ".xml");



It really doesn't matter that you're using SAX for it's incremental parsing if you've just gone and downloaded everything. Your InputStream should be your URL download.
Was This Post Helpful? 1
  • +
  • -

#8 g00se  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2781
  • View blog
  • Posts: 11,768
  • Joined: 20-September 08

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 12:45 PM

Is it possible to post an actual url?
Was This Post Helpful? 0
  • +
  • -

#9 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 01:12 PM

View Postg00se, on 04 October 2012 - 01:45 PM, said:

Is it possible to post an actual url?

http://etstext.black...m/stop/1989.xml
Was This Post Helpful? 0
  • +
  • -

#10 g00se  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2781
  • View blog
  • Posts: 11,768
  • Joined: 20-September 08

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 01:36 PM

Ok and what do you want to do with it when you've got a genuine xml document?
Was This Post Helpful? 0
  • +
  • -

#11 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 01:38 PM

View Postg00se, on 04 October 2012 - 02:36 PM, said:

Ok and what do you want to do with it when you've got a genuine xml document?

Just grab the stop Id, longitude, latitude, and stop name. The put those in an object and send the object to a list.
Was This Post Helpful? 0
  • +
  • -

#12 g00se  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2781
  • View blog
  • Posts: 11,768
  • Joined: 20-September 08

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 01:47 PM

If the valid files are pretty short, like the one you just posted, i'd be tempted to read them into a String first
Was This Post Helpful? 0
  • +
  • -

#13 xZhongCheng  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 31
  • Joined: 21-March 09

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 01:57 PM

View Postg00se, on 04 October 2012 - 02:47 PM, said:

If the valid files are pretty short, like the one you just posted, i'd be tempted to read them into a String first

I am already doing that. Its the matter of can I do this 8000 times efficiently
Y current method will take roughly 2 hours to do.

This post has been edited by xZhongCheng: 04 October 2012 - 01:59 PM

Was This Post Helpful? 0
  • +
  • -

#14 blackcompe  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1155
  • View blog
  • Posts: 2,536
  • Joined: 05-May 05

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 02:36 PM

There's a Java tutorial showing you how to create a validating SAXParser. You'll be informed of any parse errors in your error handler.
Was This Post Helpful? 0
  • +
  • -

#15 g00se  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2781
  • View blog
  • Posts: 11,768
  • Joined: 20-September 08

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 03:19 PM

You won't get much leaner and efficient than pull parsing. Try something like

import java.io.*;

import java.net.*;

import java.util.*;

import javax.xml.stream.*;


public class StaxRead {
    public static void main(String[] args) {
        try {
            Object o = new StaxRead().read(new URL(args[0]));
	    System.out.println(o);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public StopInfo read(URL u) throws IOException, XMLStreamException {
        XMLInputFactory factory = XMLInputFactory.newInstance();
        XMLStreamReader sr = factory.createXMLStreamReader(u.openStream());

        StopInfo stopInfo = new StopInfo();

        while (sr.hasNext()) {
            sr.next();

            if (sr.getEventType() == XMLStreamConstants.START_ELEMENT) {
                String elementName = sr.getLocalName();

                if ("stop-lat".equals(elementName)) {
                    stopInfo.setLatitude(Double.parseDouble(sr.getElementText()));
                } else if ("stop-lon".equals(elementName)) {
                    stopInfo.setLongitude(Double.parseDouble(
                            sr.getElementText()));
                } else if ("stop-id".equals(elementName)) {
                    stopInfo.setId(Integer.parseInt(sr.getElementText()));
                } else if ("stop-name".equals(elementName)) {
                    stopInfo.setName(sr.getElementText());
                }
            }
        }
	return stopInfo;
    }
}


You can optimize that further by stopping the pull as soon as you have filled all object fields

This post has been edited by g00se: 04 October 2012 - 03:21 PM
Reason for edit:: optimize

Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2