Regular Expressions?

C#

  • (2 Pages)
  • +
  • 1
  • 2

21 Replies - 8397 Views - Last Post: 26 December 2008 - 10:00 AM Rate Topic: -----

#1 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Regular Expressions?

Post icon  Posted 25 February 2008 - 06:42 PM

Hey everyone,

I am looking for some help doing the following in C#. I have saved an HTML file (website) and I want to read a file (user input) and search a string (user input) and print out the string if it was found. Now since this is an HTML file and there are boarders, columns etc. How can I go about tackling this problem.

Thanks,
reCoded
Is This A Good Question/Topic? 0
  • +

Replies To: Regular Expressions?

#2 PsychoCoder   User is offline

  • Google.Sucks.Init(true);
  • member icon

Reputation: 1663
  • View blog
  • Posts: 19,853
  • Joined: 26-July 07

Re: Regular Expressions?

Posted 25 February 2008 - 07:51 PM

Have a look at this snippet for Strip HTML from string with Regular Expressions, this will help with your problem :)
Was This Post Helpful? 0
  • +
  • -

#3 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 28 February 2008 - 09:07 PM

Thanks for the reply!

It was recently stated that I can not use regular expressions to solve this problem! I guess I am thinking of plan B. Also I am looking at the HTML file's source code and loading that into my program so its a txt file rather then the HTML page file.

Now my question is. Since C# processes white space, should I concatenate all the strings in the file and then search for the header in which I want to search a particular string from? (This website has Columns, Headers etc. and its basically a table I have to search and find the string input the user wants to find on the site).

Please let me know if I am taking the right approach in tackling this problem.

Thanks,
recoded
Was This Post Helpful? 0
  • +
  • -

#4 Martyr2   User is offline

  • Programming Theoretician
  • member icon

Reputation: 5576
  • View blog
  • Posts: 14,619
  • Joined: 18-April 07

Re: Regular Expressions?

Posted 28 February 2008 - 11:02 PM

Well first of all, you can do this using regular expressions. That is what regular expressions are for, finding complex textual patterns in a haystack full of text. Whitespace isn't an issue if you correctly setup the pattern to take in account white space characters. Using regular expressions is actually the ideal solution for this. It is fast, can be used to find complex patterns and the matches are easily collected.

Perhaps you can give us an example of some text and the typical search string that someone would want to search for and perhaps someone here will be kind enough to build the pattern for you.

:)
Was This Post Helpful? 0
  • +
  • -

#5 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 02 March 2008 - 07:57 PM

</tr>
	<tr>
		<td><sup>All times local.</sup></td>
		<th id="ComCanada" headers="Com">Canada<br /> bound</th>
		<th id="ComUS" headers="Com">U.S.<br /> bound</th>
		<th id="TravCanada" headers="Trav">Canada<br /> bound</th>
		<th id="TravUS" headers="Trav">U.S.<br /> bound</th>
	</tr><tr><td headers="Office"><b>St. Stephen</b><br />St. Stephen, NB / Calais, ME<br />Last updated:<br /><b>2008-03-02&nbsp;&nbsp;13:09 AST</b></td><td headers="Com ComCanada">No delay</td><td headers="Com ComUS">No delay</td><td headers="Trav TravCanada">No delay</td><td headers="Trav TravUS">No delay</td></tr>
<tr><td headers="Office"><b>Woodstock Road</b><br />Belleville, NB / Houlton, ME<br />Last updated:<br /><b>2008-03-02&nbsp;&nbsp;12:44 AST</b></td><td headers="Com ComCanada">No delay</td><td headers="Com ComUS">No delay</td><td headers="Trav TravCanada">No delay</td><td headers="Trav TravUS">No delay</td></tr>


This is a snip from the source file. I want to search a whole column of a name in one column and return the text that is next to the column (make sense?). I can't use regular expressions but I was thinking maybe search the header of the column, using maybe something that has to do with search string? Still trying to plan out how I will tackle this problem with out regular expressions and suggestions will be appreciated!

Thanks,
recoded
Was This Post Helpful? 0
  • +
  • -

#6 orcasquall   User is offline

  • D.I.C Head
  • member icon

Reputation: 13
  • View blog
  • Posts: 158
  • Joined: 14-September 07

Re: Regular Expressions?

Posted 05 March 2008 - 06:54 AM

Actually, I think it would have been better if you had just given the URL... Is this the one?
Border wait times

So, say you search for "Stephen", and you want "No delay"? If this is so, try the following code
// sHtml contains the HTML code
string sSearch = "Stephen";
MatchCollection mc = Regex.Matches(sHtml, string.Format("<td.*?>.*?{0}.*?</td>\\s*<td.*?>.*?</td>", sSearch), RegexOptions.IgnoreCase);
if (mc.Count > 0)
{
    Match m = mc[0];
    string sRequired = Regex.Replace(m.ToString(), "<td.*?>.*?</td>\\s*<td.*?>(?<req>.*?)</td>", "${req}", RegexOptions.IgnoreCase);
    Console.WriteLine(sRequired);
}
else
{
    Console.WriteLine("No results");
}



As Martyr2 mentioned, provide the example source, a typical search string and the desired result. It's very hard to know if we got it right if we've got nothing to compare it with...

The regular expressions you need depends heavily on the structure of the HTML. So if the source HTML changes, the regular expressions used will need to be modified.

Hope this helps!
Was This Post Helpful? 0
  • +
  • -

#7 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 05 March 2008 - 10:08 AM

Thank you for the reply!

Yes that helped out a lot. I appreciate all the responses.

-recoded
Was This Post Helpful? 0
  • +
  • -

#8 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 06 March 2008 - 12:53 AM

Using the website above. Should I go through the file and make it one long string before I start breaking it down and finding the crossing that the user wants to find and check if there is a delay?
Was This Post Helpful? 0
  • +
  • -

#9 orcasquall   User is offline

  • D.I.C Head
  • member icon

Reputation: 13
  • View blog
  • Posts: 158
  • Joined: 14-September 07

Re: Regular Expressions?

Posted 06 March 2008 - 05:22 AM

This was how I did it (in full)...
string sHtml = string.Empty;
StreamReader sr = new StreamReader("yoursource.html");
sHtml = sr.ReadToEnd();
sr.Close();

// sHtml contains the HTML code
string sSearch = "Stephen";
MatchCollection mc = Regex.Matches(sHtml, string.Format("<td.*?>.*?{0}.*?</td>\\s*<td.*?>.*?</td>", sSearch), RegexOptions.IgnoreCase);
if (mc.Count > 0)
{
    Match m = mc[0];
    string sRequired = Regex.Replace(m.ToString(), "<td.*?>.*?</td>\\s*<td.*?>(?<req>.*?)</td>", "${req}", RegexOptions.IgnoreCase);
    Console.WriteLine(sRequired);
}
else
{
    Console.WriteLine("No results");
}


A couple of lines of code, and you've got the entire HTML in one string. Then you parse the HTML. You'll have to correctly assign whatever you want to search into sSearch of course.

You'd probably want to separate the reading-in-HTML part and the search part, then you won't read in the entire HTML every time you search. There's something called the variable for storing stuff... :)

I don't think there's much of a performance decrease if you parse the entire HTML every time you search, if that's what you mean. So just store everything in one string; there's no need to break down each row in the table.

This will work well for searches that are low in number. If you need to search frequently, then a better way might be to parse the entire HTML structure once, and store the relation results somewhere. You'll have to parse once every hour then, based on the update frequency of the source site.

HINT: The number of Match objects in the MatchCollection is the number of rows in the HTML table. Use a foreach to loop through the MatchCollection to get all the relation results.
Was This Post Helpful? 0
  • +
  • -

#10 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 06 March 2008 - 05:52 PM

I see you used regular expressions? I was told not to use regular expressions, that is why I am running into dead ends programming this. Thanks for the help though.
Was This Post Helpful? 0
  • +
  • -

#11 Jayman   User is offline

  • Student of Life
  • member icon

Reputation: 423
  • View blog
  • Posts: 9,532
  • Joined: 26-December 05

Re: Regular Expressions?

Posted 06 March 2008 - 08:02 PM

Since you can't use RegEx, then simply read the file into a string and use the String methods to look for the tags and then retrieve the text of each one using the Index method.

It will take some work, but you can do it. Give it a try and if you have problems then post your questions here or if its unrelated then create a new topic.
Was This Post Helpful? 0
  • +
  • -

#12 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 06 March 2008 - 11:55 PM

Yeah I am working on it now. Thanks a lot everyone for all the replies, really helped!


Thanks,
reCoded
Was This Post Helpful? 0
  • +
  • -

#13 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 10 March 2008 - 06:04 AM

Hey guys,

I am having trouble with the following code. It gives me a null reference error. I believe its this piece of code that is throwing me that error " if (line.Contains("Canada Border Services Agency"))" I am not sure what is wrong with this. I have been trying to figure it out, but still no luck. It's probably something simple being over looked.

 while (line != "</html>")
        {
            //Console.WriteLine("yes");
            if (line.Contains("Canada Border Services Agency"))
            {
                Console.WriteLine("This is the Canadian website.");
            }  
            
            line = reader.ReadLine();
            
        
        }

        reader.Close();


This post has been edited by PsychoCoder: 10 March 2008 - 06:15 AM

Was This Post Helpful? 0
  • +
  • -

#14 PsychoCoder   User is offline

  • Google.Sucks.Init(true);
  • member icon

Reputation: 1663
  • View blog
  • Posts: 19,853
  • Joined: 26-July 07

Re: Regular Expressions?

Posted 10 March 2008 - 06:16 AM

I think in order to help we're going to see the code before this section, where you're setting the value for line :)
Was This Post Helpful? 0
  • +
  • -

#15 reCoded   User is offline

  • D.I.C Regular

Reputation: 6
  • View blog
  • Posts: 282
  • Joined: 25-February 08

Re: Regular Expressions?

Posted 10 March 2008 - 12:25 PM

String inputFile = args[0];
        String borderCross = args[1];
        

           // if (!File.Exists(inputFile))
           // {
            //    Console.WriteLine("Input file does not exist.");
           //     return;
           // }

        StreamReader reader = new StreamReader(inputFile);
        String line = reader.ReadLine();
        String wholeFile = reader.ReadToEnd();

       

        
        while (line != "</html>")
        {
            //Console.WriteLine("yes");
            if (line.Contains("Canada Border Services Agency"))
            {
                Console.WriteLine("This is the Canadian website.");
            }  
            
            line = reader.ReadLine();
            
        
        }

        reader.Close();


This post has been edited by PsychoCoder: 10 March 2008 - 01:59 PM

Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2