Well, there are many problems that I have encountered with parsing the homepage. In a discussion that I had with PsychoCoder and KYA, we established that this problem is the same for C# and VB.NET. THere are actually two issues. The first is the "&" character that shows up. By doing a search and replace and replacing "&" with "&", you fix the issue in .NET.
If you use Java it still doesn't parse and you get an error that states "Failed parsing - Invalid byte 1 of 1-byte UTF-8 sequence."
KYA suggested that the error stemmed from the fact that the XML file said it was in UTF-8 but it really wasn't; it's in iso8859-1 encoding. To fix this, simply replace it. Here is the code to fix both problems in Java:
If anyone wants me to post the solution to other languages, let me know in a PM and I'll add it.
If you use Java it still doesn't parse and you get an error that states "Failed parsing - Invalid byte 1 of 1-byte UTF-8 sequence."
KYA suggested that the error stemmed from the fact that the XML file said it was in UTF-8 but it really wasn't; it's in iso8859-1 encoding. To fix this, simply replace it. Here is the code to fix both problems in Java:
/**
* This method downloads a website file line-by-line, applies a fix,
* and then prints each line into the given File.
* @param from The URL from which to download (String)
* @param to The File to stick the lines into
*/
public static void downloadToFile(String from, File to) {
URL u = null;
// Load a URL from the String name
try {
u = new URL(from);
} catch (MalformedURLException ex) {
System.err.println("Problem with the URL");
return;
}
// Copy the XML document to a local file
PrintWriter write = null;
BufferedReader read = null;
try {
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.connect();
read = new BufferedReader(
new InputStreamReader((InputStream) conn.getContent()));
write = new PrintWriter(to);
String line;
do {
line = read.readLine();
// Calls a fix for bad XML. Needs "&" to be gone
line = checkAndRemoveAmp(line);
if (line != null) {
line = line.replaceAll("UTF-8", "iso8859-1");
write.println(line);
}
} while (line != null);
read.close();
} catch(IOException e) {
System.err.println("Error writing file");
return;
} finally {
write.close();
}
}
/**
* This method serves to remove any stray "&" characters from the
* XML file. It checks to see if the index of "&" is equal to "&"
* and if so, then it is valid, otherwise, it replaces it with an "&"
* entity
* @param line The line of text to check for stray "&"'s
* @return A string empty of "&"'s
*/
public static String checkAndRemoveAmp(String line) {
if (line != null && line.contains("&")) {
int amp = line.indexOf("&");
int fullAmp = line.indexOf("&");
while (amp != -1) {
if (fullAmp != amp) {
String before = line.substring(0, amp);
String after = line.substring(amp + 1);
line = before + "&" + after;
}
amp = line.indexOf("&", amp + 3);
fullAmp = line.indexOf("&", fullAmp + 3);
}
}
return line;
}
If anyone wants me to post the solution to other languages, let me know in a PM and I'll add it.
0 Comments On This Entry
Recent Entries
Search My Blog
Recent Comments
My Blog Links
0 user(s) viewing
0 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)
|
|



Leave Comment










|