Subscribe to Dogstopper's Code Mania        RSS Feed
***** 1 Votes

Bad XML Parsing on DIC homepage

Icon Leave Comment
Well, there are many problems that I have encountered with parsing the homepage. In a discussion that I had with PsychoCoder and KYA, we established that this problem is the same for C# and VB.NET. THere are actually two issues. The first is the "&" character that shows up. By doing a search and replace and replacing "&" with "&", you fix the issue in .NET.

If you use Java it still doesn't parse and you get an error that states "Failed parsing - Invalid byte 1 of 1-byte UTF-8 sequence."

KYA suggested that the error stemmed from the fact that the XML file said it was in UTF-8 but it really wasn't; it's in iso8859-1 encoding. To fix this, simply replace it. Here is the code to fix both problems in Java:
    /**
     * This method downloads a website file line-by-line, applies a fix,
     * and then prints each line into the given File.
     * @param from The URL from which to download (String)
     * @param to The File to stick the lines into
     */
    public static void downloadToFile(String from, File to) {
        URL u = null;

        // Load a URL from the String name
        try {
             u = new URL(from);
        } catch (MalformedURLException ex) {
            System.err.println("Problem with the URL");
            return;
        }

        // Copy the XML document to a local file
        PrintWriter write = null;
        BufferedReader read = null;
        try {
            HttpURLConnection conn = (HttpURLConnection) u.openConnection();
            conn.connect();

            read = new BufferedReader(
                    new InputStreamReader((InputStream) conn.getContent()));
            write = new PrintWriter(to);

            String line;
            do {
                line = read.readLine();

                // Calls a fix for bad XML. Needs "&" to be gone
                line = checkAndRemoveAmp(line);


                if (line != null) {
                    line = line.replaceAll("UTF-8", "iso8859-1");
                    write.println(line);
                    
                }
            } while (line != null);

            read.close();

        } catch(IOException e) {
            System.err.println("Error writing file");
            return;
        } finally {
            write.close();
        }
    }

    /**
     * This method serves to remove any stray "&" characters from the
     * XML file. It checks to see if the index of "&" is equal to "&"
     * and if so, then it is valid, otherwise, it replaces it with an "&"
     * entity
     * @param line The line of text to check for stray "&"'s
     * @return A string empty of "&"'s
     */
    public static String checkAndRemoveAmp(String line) {
        if (line != null && line.contains("&")) {
            int amp = line.indexOf("&");
            int fullAmp = line.indexOf("&");
            while (amp != -1) {

                if (fullAmp != amp) {
                    String before = line.substring(0, amp);
                    String after =  line.substring(amp + 1);
                    line = before + "&" + after;
                }

                amp = line.indexOf("&", amp + 3);
                fullAmp = line.indexOf("&", fullAmp + 3);
            }
        }

        return line;
    }



If anyone wants me to post the solution to other languages, let me know in a PM and I'll add it.

0 Comments On This Entry

 

July 2014

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
27 28 293031  

Recent Entries

Search My Blog

Recent Comments

1 user(s) viewing

1 Guests
0 member(s)
0 anonymous member(s)