Join 244,264 C# Programmers for FREE! Get instant access to thousands of C# experts, tutorials, code snippets, and more! There are 1,251 people online right now. Registration is fast and FREE... Join Now!
I am looking for some help doing the following in C#. I have saved an HTML file (website) and I want to read a file (user input) and search a string (user input) and print out the string if it was found. Now since this is an HTML file and there are boarders, columns etc. How can I go about tackling this problem.
It was recently stated that I can not use regular expressions to solve this problem! I guess I am thinking of plan B. Also I am looking at the HTML file's source code and loading that into my program so its a txt file rather then the HTML page file.
Now my question is. Since C# processes white space, should I concatenate all the strings in the file and then search for the header in which I want to search a particular string from? (This website has Columns, Headers etc. and its basically a table I have to search and find the string input the user wants to find on the site).
Please let me know if I am taking the right approach in tackling this problem.
Well first of all, you can do this using regular expressions. That is what regular expressions are for, finding complex textual patterns in a haystack full of text. Whitespace isn't an issue if you correctly setup the pattern to take in account white space characters. Using regular expressions is actually the ideal solution for this. It is fast, can be used to find complex patterns and the matches are easily collected.
Perhaps you can give us an example of some text and the typical search string that someone would want to search for and perhaps someone here will be kind enough to build the pattern for you.
This is a snip from the source file. I want to search a whole column of a name in one column and return the text that is next to the column (make sense?). I can't use regular expressions but I was thinking maybe search the header of the column, using maybe something that has to do with search string? Still trying to plan out how I will tackle this problem with out regular expressions and suggestions will be appreciated!
Actually, I think it would have been better if you had just given the URL... Is this the one? Border wait times
So, say you search for "Stephen", and you want "No delay"? If this is so, try the following code
csharp
// sHtml contains the HTML code string sSearch = "Stephen"; MatchCollection mc = Regex.Matches(sHtml, string.Format("<td.*?>.*?{0}.*?</td>\\s*<td.*?>.*?</td>", sSearch), RegexOptions.IgnoreCase); if (mc.Count > 0) { Match m = mc[0]; string sRequired = Regex.Replace(m.ToString(), "<td.*?>.*?</td>\\s*<td.*?>(?<req>.*?)</td>", "${req}", RegexOptions.IgnoreCase); Console.WriteLine(sRequired); } else { Console.WriteLine("No results"); }
As Martyr2 mentioned, provide the example source, a typical search string and the desired result. It's very hard to know if we got it right if we've got nothing to compare it with...
The regular expressions you need depends heavily on the structure of the HTML. So if the source HTML changes, the regular expressions used will need to be modified.
Using the website above. Should I go through the file and make it one long string before I start breaking it down and finding the crossing that the user wants to find and check if there is a delay?
string sHtml = string.Empty; StreamReader sr = new StreamReader("yoursource.html"); sHtml = sr.ReadToEnd(); sr.Close();
// sHtml contains the HTML code string sSearch = "Stephen"; MatchCollection mc = Regex.Matches(sHtml, string.Format("<td.*?>.*?{0}.*?</td>\\s*<td.*?>.*?</td>", sSearch), RegexOptions.IgnoreCase); if (mc.Count > 0) { Match m = mc[0]; string sRequired = Regex.Replace(m.ToString(), "<td.*?>.*?</td>\\s*<td.*?>(?<req>.*?)</td>", "${req}", RegexOptions.IgnoreCase); Console.WriteLine(sRequired); } else { Console.WriteLine("No results"); }
A couple of lines of code, and you've got the entire HTML in one string. Then you parse the HTML. You'll have to correctly assign whatever you want to search into sSearch of course.
You'd probably want to separate the reading-in-HTML part and the search part, then you won't read in the entire HTML every time you search. There's something called the variable for storing stuff...
I don't think there's much of a performance decrease if you parse the entire HTML every time you search, if that's what you mean. So just store everything in one string; there's no need to break down each row in the table.
This will work well for searches that are low in number. If you need to search frequently, then a better way might be to parse the entire HTML structure once, and store the relation results somewhere. You'll have to parse once every hour then, based on the update frequency of the source site.
HINT: The number of Match objects in the MatchCollection is the number of rows in the HTML table. Use a foreach to loop through the MatchCollection to get all the relation results.
I see you used regular expressions? I was told not to use regular expressions, that is why I am running into dead ends programming this. Thanks for the help though.
Since you can't use RegEx, then simply read the file into a string and use the String methods to look for the tags and then retrieve the text of each one using the Index method.
It will take some work, but you can do it. Give it a try and if you have problems then post your questions here or if its unrelated then create a new topic.
I am having trouble with the following code. It gives me a null reference error. I believe its this piece of code that is throwing me that error " if (line.Contains("Canada Border Services Agency"))" I am not sure what is wrong with this. I have been trying to figure it out, but still no luck. It's probably something simple being over looked.
csharp
while (line != "</html>") { //Console.WriteLine("yes"); if (line.Contains("Canada Border Services Agency")) { Console.WriteLine("This is the Canadian website."); }
line = reader.ReadLine();
}
reader.Close();
This post has been edited by PsychoCoder: 10 Mar, 2008 - 05:15 AM
// if (!File.Exists(inputFile)) // { // Console.WriteLine("Input file does not exist."); // return; // }
StreamReader reader = new StreamReader(inputFile); String line = reader.ReadLine(); String wholeFile = reader.ReadToEnd();
while (line != "</html>") { //Console.WriteLine("yes"); if (line.Contains("Canada Border Services Agency")) { Console.WriteLine("This is the Canadian website."); }
line = reader.ReadLine();
}
reader.Close();
This post has been edited by PsychoCoder: 10 Mar, 2008 - 12:59 PM
Try commenting out the following line and then running your code. You also should be using the Peek method to make sure you haven't reached the EOF.
CODE
// comment this line out in your code //String wholeFile = reader.ReadToEnd();
I think the problem here is that you perform a ReadLine and then a ReadToEnd which is going to move the pointer that the StreamReader class uses to keep track of where it is in the file to the end of the file. This will then cause the next ReadLine call to fail.
I want it to do the same as the previous site for this one as well. I am not sure what is wrong with my code and why it is not working properly. Any help would be much appreciated! http://apps.cbp.gov/bwt/
This post has been edited by PsychoCoder: 20 Mar, 2008 - 05:23 AM
Hmm... actually someone (from DIC) contacted me a week ago about the exact problem you asked. Is there a class project involving border wait times that I don't know about?
The new site you gave has a very different HTML structure than the one you originally gave. Which is why the regular expressions you used from before failed.
Anyway, this is what I gave that fellow:
csharp
string sHtml = string.Empty; StreamReader sr = new StreamReader("bordertext.txt"); sHtml = sr.ReadToEnd(); sr.Close();
// sHtml contains the HTML code string sBorder = "Ferry"; string sRequiredColumnHeader = "pv stdpv"; // grab rows first MatchCollection mc = Regex.Matches(sHtml, "<tr.*?>.*?</tr>", RegexOptions.IgnoreCase); if (mc.Count > 0) { string sRequired = string.Empty; string sMatch = string.Empty; foreach (Match m in mc) { sMatch = m.ToString(); // check if the row contains your search term. Note that only the first column // should contain it (you're searching based on the "border" text right?) // Hence this part <tr.*?>\\s*<td.*?>.*?{0}.*?</td> if (Regex.IsMatch(sMatch, string.Format("<tr.*?>\\s*<td.*?>.*?{0}.*?</td>.*?</tr>", sBorder))) { // Here, we don't care about the first column already. // As long as one of the columns contain the required column header // (you gave "pv stdpv") where the column contains your result, it's fine. sRequired = Regex.Replace(sMatch, string.Format("<tr.*?>.*?<td.*?{0}.*?>(?<required>.*?)</td>.*?</tr>", sRequiredColumnHeader), "${required}", RegexOptions.IgnoreCase); break; } } Console.WriteLine(sRequired); } else { Console.WriteLine("No results"); }
The HTML is more complex, so there's a two step process. First get the table row (by tr tag), then grab data between specific column.
If you dive into the HTML of that new site, you'll find there are td headers. Assuming you're given a search term based on border name, and you want to extract the data in the "STANDARD" column for passenger vehicles, then the td header is "pv stdpv". I'll leave it to you to strip away the br tag once you have the result...
To the fellow that I first gave the code to: I'm afraid the code I sent you is no longer exclusive... If it's any consolation, at least you've got a week's headstart.
Hopefully, this isn't a class project. Otherwise, pray that both of you don't have the same instructor...