1 Replies - 794 Views - Last Post: 13 February 2015 - 05:27 AM Rate Topic: -----

#1 red_riding_hoos   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 03-January 15

Efficient counting of duplicate rows in a numy array

Posted 13 February 2015 - 12:26 AM

Hello, Python Masters!

This time I am quite desperately looking for a very efficient method to get rid of those rows of a numpy array
which are duplicate. What is more, I have to keep the number of copies each row had in the original array, because after obtaining
unique array I am willing to insert an extra column with this info to this newly-built array. As mentioned, I reapeat that this extra column would inform upon no. of duplicates of a particular row in the original array.
Let me give you a baybish exmaple of what I am willing to do:

ORIGINAL ARRAY (my matrices are binary so that's a habit for me to use 1s and 0s..
[
[1,0,0,0,0],
[0,1,0,0,0],
[1,0,0,0,0],
[1,0,0,0,0],
[1,0,0,0,0],
[1,1,1,1,1],
[0,0,0,0,1],
[0,0,0,0,1],
]

ARRAY AFTER TRANSAFORMATION (with extra col):
[
[1,0,0,0,0, 4],
[0,1,0,0,0, 1],
[1,1,1,1,1, 1],
[0,0,0,0,1, 2 ],

]


I've tried this command:
duplicate_counter=(my_array[:,np.newaxis,:] == my_array).all(axis=2).sum(axis=1)


Since my numpy array has 170 cols and 10 000 rows these operation seem to be neverending (after 15 mins of waiting I've given up) -

Works fine for baby mini arrays but for such a huge one it's a horror...
What I am really intending to do is to find the fastest way to obtain array with no duplicates and an extra column keeping record of duplicates in the oroginal array.
Probably duplicate counter step is not necessary...
And can you help me with deleting duplicate rows??

Your help would be very much appreciated!
Cheers,

Red_Riding_Hood

Is This A Good Question/Topic? 0
  • +

Replies To: Efficient counting of duplicate rows in a numy array

#2 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7505
  • View blog
  • Posts: 15,553
  • Joined: 16-October 07

Re: Efficient counting of duplicate rows in a numy array

Posted 13 February 2015 - 05:27 AM

Use a dictionary.
>>> a = [
...     [1,0,0,0,0],
...     [0,1,0,0,0],
...     [1,0,0,0,0],
...     [1,0,0,0,0],
...     [1,0,0,0,0],
...     [1,1,1,1,1],
...     [0,0,0,0,1],
...     [0,0,0,0,1],
...     ]
>>> 
>>> d = {}
>>> for x in a: d.setdefault(str(x), [x,0])[1] += 1
... 
>>> d.values()
[[[0, 1, 0, 0, 0], 1], [[1, 0, 0, 0, 0], 4], [[0, 0, 0, 0, 1], 2], [[1, 1, 1, 1, 1], 1]]
>>> 



Hmm, I feel bad about that one liner: let me expand it a little:
>>> d = {}
>>> for x in a:
...     y = str(x) # this puppy is hashable, the list ain't
...     if not y in d:
...         # put a value in the dictionary 
...         # if it doesn't exist yet
...         # helpfully, also store the original object
...         d[y] = [x, 0] # put a value in the dictionary 
...     # we now know for certain that object y is a key
...     # increment the second value, our count
...     d[y][1] += 1
... 
>>> # this is our dictionary, with keys and values
... d
{'[0, 1, 0, 0, 0]': [[0, 1, 0, 0, 0], 1], '[1, 0, 0, 0, 0]': [[1, 0, 0, 0, 0], 4], '[0, 0, 0, 0, 1]': [[0, 0, 0, 0, 1], 2], '[1, 1, 1, 1, 1]': [[1, 1, 1, 1, 1], 1]}
>>> # the result we want is just the values
... d.values()
[[[0, 1, 0, 0, 0], 1], [[1, 0, 0, 0, 0], 4], [[0, 0, 0, 0, 1], 2], [[1, 1, 1, 1, 1], 1]]
>>> 



Another way to skin this cat, though I don't know if it will be faster than above, would be to sort it and deplete it. While a list isn't hashable, it will do equality. And, having fewer values in memory could help with very large lists.

e.g.
>>> a
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]
>>> a.sort()
>>> a
[[0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 1, 1, 1, 1]]
>>> a2 = [ [ a.pop(), 1 ] ] # seed the list, first value, count 1
>>> while not a==[]:
...     v = a.pop() # removed the next value of the ORDERED list
...     if a2[-1][0]==v: 
...         a2[-1][1] += 1 # if equal to the tail, increment
...     else:
...         a2.append([v, 1]) # otherwise, add the next entry
... 
>>> a # this is now empty
[]
>>> a2 # and this is loaded
[[[1, 1, 1, 1, 1], 1], [[1, 0, 0, 0, 0], 4], [[0, 1, 0, 0, 0], 1], [[0, 0, 0, 0, 1], 2]]
>>> 


Was This Post Helpful? 1
  • +
  • -

Page 1 of 1