Count number of differences in two equal length strings

  • (2 Pages)
  • +
  • 1
  • 2

17 Replies - 2904 Views - Last Post: 05 June 2013 - 01:53 PM Rate Topic: -----

#1 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Count number of differences in two equal length strings

Posted 05 June 2013 - 07:45 AM

I'm working on a problem where I need to compare two strings of equal length and count how many times two characters in the same position are not the same, i.e.:

String 1: AACGGTCGAAGC
String 2: AACTGTCGAATC

There are 2 differences in the above string. I need to be able to do it fast, because I'm making many many comparisons. What I'm using now is:

For i As Integer = 0 To aryAnimalIDs.Count - 1
            For j = (i + 1) To aryAnimalIDs.Count - 1
                Dim Currentsr As StringReader = New StringReader(aryGenome(i))
                Dim Contrastsr As StringReader = New StringReader(aryGenome(j))
                For k As Integer = 0 To Len(aryGenome(i)) - 1
                    If ChrW(Currentsr.Read()) <> ChrW(Contrastsr.Read()) Then intDiffCount += 1
                Next
                tblResults.Rows.Add(aryAnimalIDs(i), aryAnimalIDs(j), intDiffCount)
                intDiffCount = 0
            Next
        Next



...but it's pretty slow, takes about 1.14 milliseconds per pairwise comparison (need to do 26.532 million comparisons). Any ideas on a faster way to get the results? Also, would it be possible to parallelize my for loops to speed things along? I can never get Parallel.For to work.

Also, the reason it takes 1.14 milliseconds per comparison is that my actual strings aren't 12 characters long, they're 33,381 characters long.

Is This A Good Question/Topic? 0
  • +

Replies To: Count number of differences in two equal length strings

#2 vks.gautam1   User is offline

  • D.I.C Regular

Reputation: 17
  • View blog
  • Posts: 325
  • Joined: 21-March 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 08:00 AM

well i was just trying to understand your code. i have few question sorry don't get offended
1)(aryGenome(i)) and looks same to me aryGenome(j).i means you have two string values. i feel it should be in two different array string.
2) this is two string values
String 1: AACGGTCGAAGC
String 2: AACTGTCGAATC
now you are doing this
For i As Integer = 0 To aryAnimalIDs.Count - 1
 For j = (i + 1) To aryAnimalIDs.Count - 1


means in first string u are comparing with 0 index and second you are starting from 1 index more than first string.this doesn't look like same position comparison
.
or i failed to understand what u r asking for.

This post has been edited by vks.gautam1: 05 June 2013 - 08:02 AM

Was This Post Helpful? 0
  • +
  • -

#3 modi123_1   User is online

  • Suitor #2
  • member icon



Reputation: 14146
  • View blog
  • Posts: 56,699
  • Joined: 12-June 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 08:13 AM

A few things...

Why all the for loops?

Two string compares should be pretty darn easy.
        Dim foo As String = "ABC"
        Dim bar As String = "BBC"
        Dim total As Int32 = 0
        For i As Int32 = 0 To foo.Length - 1
            If foo(i) <> bar(i) Then total += 1
        Next



Right?

Next - why are you using StringReader?

Quote

.but it's pretty slow, takes about 1.14 milliseconds per pairwise comparison (need to do 26.532 million comparisons).
...
Also, the reason it takes 1.14 milliseconds per comparison is that my actual strings aren't 12 characters long, they're 33,381 characters long.

Well yeah.. that tends to happen.. and 1.14 ms is a fairly decent time.


Quote

Also, would it be possible to parallelize my for loops to speed things along? I can never get Parallel.For to work.

Yeah - if the results do not depend on each other throw that through some parallel processing.

Of course I have no idea what "I can never get paralle.for to work" shakes down as - so read up!
http://www.dreaminco...arallelfor-101/
Was This Post Helpful? 1
  • +
  • -

#4 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 08:22 AM

View Postvks.gautam1, on 05 June 2013 - 03:00 PM, said:

well i was just trying to understand your code. i have few question sorry don't get offended
1)(aryGenome(i)) and looks same to me aryGenome(j).i means you have two string values. i feel it should be in two different array string.
2) this is two string values
String 1: AACGGTCGAAGC
String 2: AACTGTCGAATC
now you are doing this
For i As Integer = 0 To aryAnimalIDs.Count - 1
 For j = (i + 1) To aryAnimalIDs.Count - 1


means in first string u are comparing with 0 index and second you are starting from 1 index more than first string.this doesn't look like same position comparison
.
or i failed to understand what u r asking for.


So, I'm comparing each item in the array to every other item in the array. There are 5,151 genomes in my array, so the outer For loop takes, for example, index 0 and runs my comparison for index 1 through 5,151. The next time it executes, it compares Index 1 to index 2 through 5,151. I don't need to compare 1 to 0 again because I did that in the first go-around.

Each index represents a difference animal, so when I use the stringreader to read through two different indexes, I'm comparing the same position (in this case a gene) in two different animals.

My output, then, will be a list of all 26,532,801 comparisons and the number of differences found for each.

View Postmodi123_1, on 05 June 2013 - 03:13 PM, said:

A few things...

Why all the for loops?

Two string compares should be pretty darn easy.
        Dim foo As String = "ABC"
        Dim bar As String = "BBC"
        Dim total As Int32 = 0
        For i As Int32 = 0 To foo.Length - 1
            If foo(i) <> bar(i) Then total += 1
        Next



Right?

Next - why are you using StringReader?

Quote

.but it's pretty slow, takes about 1.14 milliseconds per pairwise comparison (need to do 26.532 million comparisons).
...
Also, the reason it takes 1.14 milliseconds per comparison is that my actual strings aren't 12 characters long, they're 33,381 characters long.

Well yeah.. that tends to happen.. and 1.14 ms is a fairly decent time.


Quote

Also, would it be possible to parallelize my for loops to speed things along? I can never get Parallel.For to work.

Yeah - if the results do not depend on each other throw that through some parallel processing.

Of course I have no idea what "I can never get paralle.for to work" shakes down as - so read up!
http://www.dreaminco...arallelfor-101/


I should have been more explicit, what I meant was I'm having trouble getting my Parallel.For loop to exit when I need it to; normally I would just use Exit For when my conditions arise, but there doesn't seem to be an Exit For for a parallel loop.

I did get rid of my string reader and use your much simpler method for string comparisons; I didn't know I can refer to a character in a string by its index. Thanks for that.

This post has been edited by C.Andrews: 05 June 2013 - 08:30 AM

Was This Post Helpful? 0
  • +
  • -

#5 modi123_1   User is online

  • Suitor #2
  • member icon



Reputation: 14146
  • View blog
  • Posts: 56,699
  • Joined: 12-June 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 08:28 AM

Mmmm... so why would you want to break out of your parallel for loop? You should be going until completion for all instances, right?
Was This Post Helpful? 0
  • +
  • -

#6 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 08:37 AM

View Postmodi123_1, on 05 June 2013 - 03:28 PM, said:

Mmmm... so why would you want to break out of your parallel for loop? You should be going until completion for all instances, right?


Only for testing purposes. I only want to run through the first 100,000 or so while I'm optimizing so I don't spend an eternity on each change.

Edit: Getting rid of the reader cut my time per comparison from 1.14 to .7 ms. Not bad.

Edit again: God, I'm retarded. I can just use a smaller input file for testing and I won't have to exit my loop.

This post has been edited by C.Andrews: 05 June 2013 - 08:38 AM

Was This Post Helpful? 0
  • +
  • -

#7 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 09:01 AM

Ok, now that I've gotten my loop more or less optimal, I'm working on Parallelizing, and I'm having some trouble. I'm guessing it's because I'm trying to add rows to a datatable inside my parallel loop and the threads aren't playing nicely. Is there a safe way to do this, or a better method of storing the results from each parallel thread? What I've done is convert:

        For i As Integer = 0 To aryAnimalIDs.Count - 1
            For j = (i + 1) To aryAnimalIDs.Count - 1
                For k As Integer = 0 To Len(aryGenome(i)) - 1
                    strCurrent = aryGenome(i)
                    strContrast = aryGenome(j)
                    If strCurrent(k) <> strContrast(k) Then intDiffCount += 1
                Next
                tblResults.Rows.Add(aryAnimalIDs(i), aryAnimalIDs(j), intDiffCount)
                intDiffCount = 0
            Next
        Next



in to:

Parallel.For(0, (aryAnimalIDs.Count - 2), Sub(i As Integer)
                                                      For j = (i + 1) To aryAnimalIDs.Count - 1
                                                          For k As Integer = 0 To Len(aryGenome(i)) - 1
                                                              strCurrent = aryGenome(i)
                                                              strContrast = aryGenome(j)
                                                              If strCurrent(k) <> strContrast(k) Then intDiffCount += 1
                                                          Next
                                                          tblResults.Rows.Add(aryAnimalIDs(i), aryAnimalIDs(j), intDiffCount)
                                                          intDiffCount = 0
                                                      Next
                                                  End Sub)



My parallel loop takes longer and also doesn't return the correct number of results, probably for the reason I mentioned above.

Edit:

I also tried getting rid of my table and just using an array, but it doesn't seem to save any processing time. I also tried parallelizing my inner loop, but that takes longer yet, although I do get the correct output.

For i As Integer = 0 To AnimalCount - 2
            For j = (i + 1) To AnimalCount - 1
                Parallel.For(0, (GeneCount - 1), Sub(k As Integer)
                                                     strCurrent = aryGenome(i)
                                                     strContrast = aryGenome(j)
                                                     If strCurrent(k) <> strContrast(k) Then intDiffCount += 1
                                                 End Sub)
                Dim sb As New System.Text.StringBuilder()
                aryResults.Add(sb.Append(aryAnimalIDs(i)).Append(" ").Append(aryAnimalIDs(j)).Append(" ").Append(intDiffCount).ToString())
                intDiffCount = 0
            Next
        Next


This post has been edited by C.Andrews: 05 June 2013 - 09:45 AM

Was This Post Helpful? 0
  • +
  • -

#8 modi123_1   User is online

  • Suitor #2
  • member icon



Reputation: 14146
  • View blog
  • Posts: 56,699
  • Joined: 12-June 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 10:37 AM

Side note - I don't think a parallel for loop is where you would want to go with this, right? Why not just find an interesting way of chunking the array of strings in to smaller sections and shove those into Tasks (better than average threads)? Then all the upfront work can be done and the tasks manage themselves until finished.
Was This Post Helpful? 1
  • +
  • -

#9 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 10:46 AM

View Postmodi123_1, on 05 June 2013 - 05:37 PM, said:

Side note - I don't think a parallel for loop is where you would want to go with this, right? Why not just find an interesting way of chunking the array of strings in to smaller sections and shove those into Tasks (better than average threads)? Then all the upfront work can be done and the tasks manage themselves until finished.


Ha, that's exactly what I'm doing right now. I've broken my data such that I have 8 strings ~5000 characters wide instead of 1 string 40,000 characters wide, then I'm doing the comparisons for each subset of data in it's own thread and summing the results back up at the end.
Was This Post Helpful? 0
  • +
  • -

#10 macosxnerd101   User is online

  • Games, Graphs, and Auctions
  • member icon




Reputation: 12324
  • View blog
  • Posts: 45,424
  • Joined: 27-December 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 10:50 AM

I might be inclined to use a bit approach here. Represent the Strings as bits, and add them together. The number of 1's is the number of differences.
Was This Post Helpful? 0
  • +
  • -

#11 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 12:41 PM

View Postmacosxnerd101, on 05 June 2013 - 05:50 PM, said:

I might be inclined to use a bit approach here. Represent the Strings as bits, and add them together. The number of 1's is the number of differences.


I'm not sure I follow. Scratch that, I definitely don't follow.
Was This Post Helpful? 0
  • +
  • -

#12 macosxnerd101   User is online

  • Games, Graphs, and Auctions
  • member icon




Reputation: 12324
  • View blog
  • Posts: 45,424
  • Joined: 27-December 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 12:49 PM

Strings are all just bits and bytes. Each character has an int value as defined in the ASCII table. Thus, each String can be treated as a number, representable in binary (bits). Consider the Strings in bits:
  1100011
+ 1010011
----------
  0110000



Notice there are two 1's, which also mark the differences. There should be some way to get the bytes of the Strings and go from there. I'm not really a VB.NET guy, so I'm only talking theory here.

I hope this clears up the approach some, and I hope this helps! Sorry for any confusion on my part!
Was This Post Helpful? 1
  • +
  • -

#13 C.Andrews   User is offline

  • D.I.C Head
  • member icon

Reputation: 15
  • View blog
  • Posts: 169
  • Joined: 18-October 12

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 01:02 PM

View Postmacosxnerd101, on 05 June 2013 - 07:49 PM, said:

Strings are all just bits and bytes. Each character has an int value as defined in the ASCII table. Thus, each String can be treated as a number, representable in binary (bits). Consider the Strings in bits:
  1100011
+ 1010011
----------
  0110000



Notice there are two 1's, which also mark the differences. There should be some way to get the bytes of the Strings and go from there. I'm not really a VB.NET guy, so I'm only talking theory here.

I hope this clears up the approach some, and I hope this helps! Sorry for any confusion on my part!


That makes good sense, actually, and I think you're probably on to something. If I do my conversions ahead of time I may be able to trim a lot of overhead when it comes time to compare.
Was This Post Helpful? 0
  • +
  • -

#14 macosxnerd101   User is online

  • Games, Graphs, and Auctions
  • member icon




Reputation: 12324
  • View blog
  • Posts: 45,424
  • Joined: 27-December 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 01:04 PM

I just realized I forgot to carry my one's, but those would be extraneous characters still. You could also XOR the Strings for the same result. That would probably be a better approach.

Glad I could help some!
Was This Post Helpful? 0
  • +
  • -

#15 AdamSpeight2008   User is offline

  • MrCupOfT
  • member icon

Reputation: 2298
  • View blog
  • Posts: 9,535
  • Joined: 29-May 08

Re: Count number of differences in two equal length strings

Posted 05 June 2013 - 01:25 PM

Why you using Len? In .net String know their length just use.Length
Was This Post Helpful? 1
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2