Deleting duplicates

  • (2 Pages)
  • +
  • 1
  • 2

18 Replies - 802 Views - Last Post: 27 April 2013 - 08:57 PM Rate Topic: -----

#1 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Deleting duplicates

Posted 26 April 2013 - 11:39 PM

Trying to show only unique emails but can't figure out how to do it.
I have tried many ways but everything ive tried seems to give me issues.

Can someone please look at this code and explain how I can display only unique emails in the results
try
            {
                connectMysql.Open();
                datareader = cmdMysql.ExecuteReader();
                while (datareader.Read()) {


                string urls = datareader.GetString("temp_page");
                string[] pages = urls.Split('\n');
                foreach (string page in pages)
                {
                    
                    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(page);
                    HttpWebResponse res = (HttpWebResponse)req.GetResponse();
                    StreamReader sread = new StreamReader(res.GetResponseStream());

                    string Pattern = (@"\b[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b");

                    System.Text.RegularExpressions.MatchCollection matches = System.Text.RegularExpressions.Regex.Matches(sread.ReadToEnd(), Pattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);

                    string[] MatchList = new string[matches.Count];

                    int m = 0;
                    foreach (System.Text.RegularExpressions.Match match in matches)
                    {
                        MatchList[m] = match.Groups[page].Value;
                        m++;

                        displayBox.Text += match + Environment.NewLine;
                        
                    }
                    
                }

                
                }
               
            }
            catch (MySqlException ex)
            {

                MessageBox.Show(ex.Message);
            
            }



If you can't figure out what's going on you likely can't help...lol j/k.

Anyways, I am scraping pages for emails and when the results are displayed in the displayBox.Text
I am currently getting duplicate emails within my results

foreach (System.Text.RegularExpressions.Match match in matches)
                    {
                        MatchList[m] = match.Groups[page].Value;
                        m++;

                        displayBox.Text += match + Environment.NewLine;
                        
                    }



However, only want unique emails to be displayed...

How would I write displayBox.Text += match + Environment.NewLine;
so that it only displays unique emails?

Thanks in advance!!

Is This A Good Question/Topic? 0
  • +

Replies To: Deleting duplicates

#2 Momerath  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1012
  • View blog
  • Posts: 2,444
  • Joined: 04-October 09

Re: Deleting duplicates

Posted 27 April 2013 - 02:42 AM

Change this code
foreach (System.Text.RegularExp<b></b>ressions.Match match in matches) {
    MatchList[m] = match.Groups[page].Value;
    m++;

    displayBox.Text += match + Environment.NewLine;
}


To this code
forech(Match match in matches) {
    MatchList[m++] = match.Groups[page].Value;
}

displayBox.Text = String.Join(Environment.NewLine, MatchList.Distinct());

Was This Post Helpful? 0
  • +
  • -

#3 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 03:16 AM

Thanks but unfortunately that doesn't work.

The array is actually stored in match however, I can't use Split with match because it's a regular expression.

What's happening is I am getting a field from the database that contains a string of urls.
Then I am coverting the urls into an array
 string[] pages = urls.Split('\n'); 
and, then scraping each page for emails using regular expressions.

Once I get the scraped emails, I display them in a textbox with displayBox.Text += match + Environment.NewLine;

I have tried a bunch of stuff to get it to work but am really new with c# so... having a hard time.

displayBox.Text = String.Join(Environment.NewLine, MatchList.Distinct());
displayBox.Text = String.Join(match + Environment.NewLine, MatchList.Distinct());


soooooooooooooo confused.....


View PostMomerath, on 27 April 2013 - 02:42 AM, said:

Change this code
foreach (System.Text.RegularExp<b></b>ressions.Match match in matches) {
    MatchList[m] = match.Groups[page].Value;
    m++;

    displayBox.Text += match + Environment.NewLine;
}


To this code
forech(Match match in matches) {
    MatchList[m++] = match.Groups[page].Value;
}

displayBox.Text = String.Join(Environment.NewLine, MatchList.Distinct());

Was This Post Helpful? 0
  • +
  • -

#4 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 03:31 AM

****The array is actually stored in match however, I can't use Split with match because it's a regular expression.
The array is actually stored in match however, I can't use Distinct() with match because it's a regular expression.
Was This Post Helpful? 0
  • +
  • -

#5 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 03:40 AM

With
displayBox.Text += match + Environment.NewLine;
I get duplicates, and with
displayBox.Text = String.Join(Environment.NewLine, MatchList.Distinct()); 
I get no results at all.
Was This Post Helpful? 0
  • +
  • -

#6 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 04:21 AM

actually not that confused...

displayBox.Text = String.Join(Environment.NewLine, MatchList.Distinct());

doesn't work because MatchList is really only getting the number of matches and not the actual matches.

The actual matches are stored in match but I can't write it like
displayBox.Text = String.Join(Environment.NewLine, match.Distinct());
because it triggers an error System.Text.RegularExpressions.Match does not contain a definition for 'Distinct'...
Was This Post Helpful? 0
  • +
  • -

#7 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 3622
  • View blog
  • Posts: 11,290
  • Joined: 05-May 12

Re: Deleting duplicates

Posted 27 April 2013 - 08:10 AM

*sigh* This is what happens when you are coding by Intellisense instead of taking a few seconds to look at the documentation.

Here is some pseudo code that may help:
var matchStringList = new List<string>();
foreach(var match in scrapeMatches)
    matchStringList.Add(match.Value);

displayTextBox.Text = String.Join(Environment.NewLine, matchStringList.Distinct());



Or you could do something LINQ-ish:
displayTextBox.Text = String.Join(Environment.NewLine,
                                  matches.Select(m => m.Value)
                                         .Distinct();
                      );


Was This Post Helpful? 1
  • +
  • -

#8 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 11:43 AM

*sigh*

I have tried both... and am doing everything from the manual.
First project with regular expressions or even c# for that matter.


I think I'm just going to submit the emails to the database and use Distinct() that way or
submit them to the database and create another query to delete duplicates...

Was more just looking for a solution to use Distinct() together with MatchCollection
or even a way to write
 string Pattern = (@"\b[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b");

                    System.Text.RegularExpressions.MatchCollection matches = System.Text.RegularExpressions.Regex.Matches(sread.ReadToEnd(), Pattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);

                    string[] MatchList = new string[matches.Count];

                    int c = 0;
                    foreach (System.Text.RegularExpressions.Match match in matches)
                    {
                        MatchList[c] = match.Groups[page].Value;
                        c++;
                        displayBox.Text += match + Environment.NewLine;
}                    




as a list<string[]> but no matter how I have tried it I always result in an error of some kind lol
Was This Post Helpful? 0
  • +
  • -

#9 andrewsw  Icon User is online

  • It's just been revoked!
  • member icon

Reputation: 3621
  • View blog
  • Posts: 12,510
  • Joined: 12-December 12

Re: Deleting duplicates

Posted 27 April 2013 - 12:32 PM

The following simple example works, using advice already given. I've made the list itself distinct, as well as the values that are displayed.

            string sTest = "red green blue red orange green";
            string sPattern = @"\b\w+\b";
            System.Text.RegularExpressions.MatchCollection matches =
                System.Text.RegularExpressions.Regex.Matches(sTest, sPattern);

            List<string> sMatchList = new List<string>();
            foreach (var match in matches)
            {
                sMatchList.Add(match.ToString());
            }
            List<string> sUniqueMatches = sMatchList.Distinct().ToList();

            Console.WriteLine(String.Join(Environment.NewLine, sUniqueMatches));

The only other advice I would offer is that, as this is your first project, start with something simpler. (Seems a strange choice for a first project anyway :dontgetit: particularly taking on regex and C# at the same time.)

This post has been edited by andrewsw: 27 April 2013 - 12:34 PM

Was This Post Helpful? 1
  • +
  • -

#10 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 12:47 PM

Thanks! I will try your example.

I have a ton of experience with php/MySQL so c# isn't so bad... this is just a small part of the application and I have done most of it without any issues up to now. =)



View Postandrewsw, on 27 April 2013 - 12:32 PM, said:

The following simple example works, using advice already given. I've made the list itself distinct, as well as the values that are displayed.

      string sTest = "red green blue red orange green";
            string sPattern = @"\b\w+\b";
            System.Text.RegularExpressions.MatchCollection matches =
                System.Text.RegularExpressions.Regex.Matches(sTest, sPattern);

            List<string> sMatchList = new List<string>();
            foreach (var match in matches)
            {
                sMatchList.Add(match.ToString());
            }
            List<string> sUniqueMatches = sMatchList.Distinct().ToList();

            Console.WriteLine(String.Join(Environment.NewLine, sUniqueMatches));

The only other advice I would offer is that, as this is your first project, start with something simpler. (Seems a strange choice for a first project anyway :dontgetit: particularly taking on regex and C# at the same time.)

Was This Post Helpful? 0
  • +
  • -

#11 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 01:03 PM

You solution works great. However, now I have a new issue.
the results come like:

email@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.com

no spaces between the emails...

I assume I will just have to use Replace() and do something like Replace(".com",".com\n"); but not sure if that's the best way to go about it...



View Postcodejunky, on 27 April 2013 - 12:47 PM, said:

Thanks! I will try your example.

I have a ton of experience with php/MySQL so c# isn't so bad... this is just a small part of the application and I have done most of it without any issues up to now. =)



View Postandrewsw, on 27 April 2013 - 12:32 PM, said:

The following simple example works, using advice already given. I've made the list itself distinct, as well as the values that are displayed.

      string sTest = "red green blue red orange green";
            string sPattern = @"\b\w+\b";
            System.Text.RegularExpressions.MatchCollection matches =
                System.Text.RegularExpressions.Regex.Matches(sTest, sPattern);

            List<string> sMatchList = new List<string>();
            foreach (var match in matches)
            {
                sMatchList.Add(match.ToString());
            }
            List<string> sUniqueMatches = sMatchList.Distinct().ToList();

            Console.WriteLine(String.Join(Environment.NewLine, sUniqueMatches));

The only other advice I would offer is that, as this is your first project, start with something simpler. (Seems a strange choice for a first project anyway :dontgetit:/> particularly taking on regex and C# at the same time.)

Was This Post Helpful? 0
  • +
  • -

#12 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 01:13 PM

Yeah Replace() seems to do the trick...

Thanks for the help!!
Was This Post Helpful? 0
  • +
  • -

#13 andrewsw  Icon User is online

  • It's just been revoked!
  • member icon

Reputation: 3621
  • View blog
  • Posts: 12,510
  • Joined: 12-December 12

Re: Deleting duplicates

Posted 27 April 2013 - 01:46 PM

View Postcodejunky, on 27 April 2013 - 08:13 PM, said:

Yeah Replace() seems to do the trick...

Thanks for the help!!

Why should you need to do this? Surely the emails are initially separated? But, hey, if you're happy..
Was This Post Helpful? 0
  • +
  • -

#14 codejunky  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 20
  • Joined: 18-April 13

Re: Deleting duplicates

Posted 27 April 2013 - 01:55 PM

displayBox.Text += (String.Join(Environment.NewLine, sUniqueMatches));

gives me email@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.comemail@email.com


Confused because thought
Environment.NewLine would be giving a new line

also Replace() doesn't work because if the domain is something other than .com then no new line is created...



View Postandrewsw, on 27 April 2013 - 01:46 PM, said:

View Postcodejunky, on 27 April 2013 - 08:13 PM, said:

Yeah Replace() seems to do the trick...

Thanks for the help!!

Why should you need to do this? Surely the emails are initially separated? But, hey, if you're happy..

Was This Post Helpful? 0
  • +
  • -

#15 andrewsw  Icon User is online

  • It's just been revoked!
  • member icon

Reputation: 3621
  • View blog
  • Posts: 12,510
  • Joined: 12-December 12

Re: Deleting duplicates

Posted 27 April 2013 - 02:01 PM

You don't still have displayBox.Text += .. within a loop do you? It only needs to occur once, without the + sign.
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2