Well, there are many problems that I have encountered with parsing the homepage. In a discussion that I had with PsychoCoder and KYA, we established that this problem is the same for C# and VB.NET. THere are actually two issues. The first is the "&" character that shows up. By doing a search and replace and replacing "&" with "&", you fix the issue in .NET.
If you use Java it still doesn't parse and you get an error that states "Failed parsing - Invalid byte 1 of 1-byte UTF-8 sequence."
KYA suggested that the error stemmed from the fact that the XML file said it was in UTF-8 but it really wasn't; it's in iso8859-1 encoding. To fix this, simply replace it. Here is the code to fix both problems in Java:
If anyone wants me to post the solution to other languages, let me know in a PM and I'll add it.
If you use Java it still doesn't parse and you get an error that states "Failed parsing - Invalid byte 1 of 1-byte UTF-8 sequence."
KYA suggested that the error stemmed from the fact that the XML file said it was in UTF-8 but it really wasn't; it's in iso8859-1 encoding. To fix this, simply replace it. Here is the code to fix both problems in Java:
/** * This method downloads a website file line-by-line, applies a fix, * and then prints each line into the given File. * @param from The URL from which to download (String) * @param to The File to stick the lines into */ public static void downloadToFile(String from, File to) { URL u = null; // Load a URL from the String name try { u = new URL(from); } catch (MalformedURLException ex) { System.err.println("Problem with the URL"); return; } // Copy the XML document to a local file PrintWriter write = null; BufferedReader read = null; try { HttpURLConnection conn = (HttpURLConnection) u.openConnection(); conn.connect(); read = new BufferedReader( new InputStreamReader((InputStream) conn.getContent())); write = new PrintWriter(to); String line; do { line = read.readLine(); // Calls a fix for bad XML. Needs "&" to be gone line = checkAndRemoveAmp(line); if (line != null) { line = line.replaceAll("UTF-8", "iso8859-1"); write.println(line); } } while (line != null); read.close(); } catch(IOException e) { System.err.println("Error writing file"); return; } finally { write.close(); } } /** * This method serves to remove any stray "&" characters from the * XML file. It checks to see if the index of "&" is equal to "&" * and if so, then it is valid, otherwise, it replaces it with an "&" * entity * @param line The line of text to check for stray "&"'s * @return A string empty of "&"'s */ public static String checkAndRemoveAmp(String line) { if (line != null && line.contains("&")) { int amp = line.indexOf("&"); int fullAmp = line.indexOf("&"); while (amp != -1) { if (fullAmp != amp) { String before = line.substring(0, amp); String after = line.substring(amp + 1); line = before + "&" + after; } amp = line.indexOf("&", amp + 3); fullAmp = line.indexOf("&", fullAmp + 3); } } return line; }
If anyone wants me to post the solution to other languages, let me know in a PM and I'll add it.
0 Comments On This Entry
← January 2021 →
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
31 |
Recent Entries
Search My Blog
Recent Comments
My Blog Links
0 user(s) viewing
0 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)