9 Replies - 1097 Views - Last Post: 17 September 2011 - 01:29 AM Rate Topic: -----

#1 Nekroze  Icon User is offline

  • D.I.C Head

Reputation: 14
  • View blog
  • Posts: 170
  • Joined: 08-May 11

More efficient multi string splitting?

Posted 16 September 2011 - 03:49 AM

Heyo DIC's,

I am working on something and i need to split a string into a list based on multiple tokens.

The Input string i want to split will be formatted as such:
"Field1^Field2^Otherpart1@Differentpart1#Differentpart2"


i have nead to have a list containing all the different fields without thier tokens in a list

currently i use:
temp1 = string.split( Input, "^" ) #splits ID and IP of caller from input
    temp2 = string.split( temp1[2], "@" )
    temp3 = string.split( temp2[1], "#" )
    commands = ( temp1[0], temp1[1], temp2[0], temp3)


however I feel this is very inefficient anyone got any better way?

thanks guys,
Nekroze

Is This A Good Question/Topic? 0
  • +

Replies To: More efficient multi string splitting?

#2 Nekroze  Icon User is offline

  • D.I.C Head

Reputation: 14
  • View blog
  • Posts: 170
  • Joined: 08-May 11

Re: More efficient multi string splitting?

Posted 16 September 2011 - 04:57 AM

To be clear this code will be run very often and needs to be as efficient as possible and coming from a C background i know to be weary of heavy string operations.

Just the amount of variables here and the amount of copying between them seems very inefficient.

Just had to clarify and sorry for double post but I know how to edit the OP post on DIC.

This post has been edited by Nekroze: 16 September 2011 - 05:08 AM

Was This Post Helpful? 0
  • +
  • -

#3 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5909
  • View blog
  • Posts: 12,815
  • Joined: 16-October 07

Re: More efficient multi string splitting?

Posted 16 September 2011 - 05:42 AM

My first thought was this going to be simple. From the docs:

Quote

The sep argument may consist of multiple characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns [''].
-- http://docs.python.o....html#str.split


However, they lie:
>>> "Field1^Field2^Otherpart1@Differentpart1#Differentpart2".split("^@#")
['Field1^Field2^Otherpart1@Differentpart1#Differentpart2']
>>> 



For a one lined, regular expressions never fail to impress:
>>> import re
>>> re.split('[\^@#]', "Field1^Field2^Otherpart1@Differentpart1#Differentpart2") 
['Field1', 'Field2', 'Otherpart1', 'Differentpart1', 'Differentpart2']
>>> 


You could compile that for faster repetition.

It's so very simple, though...

This is a little cludgy, but will do it:
def mySplit(s, symbols):
	result = [s]
	for ch in symbols:
		newResult = []
		for i in result:
			newResult.extend(i.split(ch))
		result = newResult
	return result

print mySplit("Field1^Field2^Otherpart1@Differentpart1#Differentpart2", "^@#")


This post has been edited by baavgai: 16 September 2011 - 05:43 AM

Was This Post Helpful? 1
  • +
  • -

#4 Nekroze  Icon User is offline

  • D.I.C Head

Reputation: 14
  • View blog
  • Posts: 170
  • Joined: 08-May 11

Re: More efficient multi string splitting?

Posted 16 September 2011 - 05:49 AM

ah I didn't think to do that, by "compile that" do you mean write the function myself like you do in the last example? I am still a bit new to python but thanks a heap mate, your a legend.
Was This Post Helpful? 0
  • +
  • -

#5 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5909
  • View blog
  • Posts: 12,815
  • Joined: 16-October 07

Re: More efficient multi string splitting?

Posted 16 September 2011 - 05:52 AM

View PostNekroze, on 16 September 2011 - 07:57 AM, said:

To be clear this code will be run very often and needs to be as efficient as possible and coming from a C background i know to be weary of heavy string operations.


Ooo, new challenge! Perhaps a more C like approach?

def mySplit(s, symbols):
	result = []
	last = 0
	for i in range(len(s)):
		if s[i] in symbols:
			result.append(s[last:i])
			last = i + 1
	result.append(s[last:])
	return result


Was This Post Helpful? 0
  • +
  • -

#6 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5909
  • View blog
  • Posts: 12,815
  • Joined: 16-October 07

Re: More efficient multi string splitting?

Posted 16 September 2011 - 05:59 AM

View PostNekroze, on 16 September 2011 - 08:49 AM, said:

by "compile that" do you mean ...


Well, compile the expression. Essentially remove the overhead of interpreteing the expression.

>>> import re
>>> 
>>> parser = re.compile('[\^@#]')
>>> parser.split("Field1^Field2^Otherpart1@Differentpart1#Differentpart2")
['Field1', 'Field2', 'Otherpart1', 'Differentpart1', 'Differentpart2']
>>> 



Really, take some kind and just time test it. It's the only way to know for sure. We have some regex time tests lurking around here somewhere.
Was This Post Helpful? 1
  • +
  • -

#7 Motoma  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 452
  • View blog
  • Posts: 797
  • Joined: 08-June 10

Re: More efficient multi string splitting?

Posted 16 September 2011 - 06:43 AM

View Postbaavgai, on 16 September 2011 - 08:59 AM, said:

Really, take some kind and just time test it. It's the only way to know for sure.


This point can't be emphasized enough. Experimentation yields the greatest understanding.

Set up a testing harness that executes a method and times its execution. Measure how long it takes to run each algorithm a million times; you'll know soon enough which technique works the best with your data.
Was This Post Helpful? 0
  • +
  • -

#8 Nekroze  Icon User is offline

  • D.I.C Head

Reputation: 14
  • View blog
  • Posts: 170
  • Joined: 08-May 11

Re: More efficient multi string splitting?

Posted 16 September 2011 - 03:19 PM

I am trying to get the timeit module to work to test these things however I always get this error, note: I am following multiple guides on the timeit module exactly yet still get this.

>>> import test
>>> test.runtest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "test.py", line 26, in runtest
    test1.timeit()
  File "C:\Python26\lib\timeit.py", line 193, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner


then it says that the function I called to "timeit" has a NameError global name is not defined even thoough I can call it outside of this timeit stuff perfectly... confuses me a touch cause I am doing what all the documentation says to.
Was This Post Helpful? 0
  • +
  • -

#9 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5909
  • View blog
  • Posts: 12,815
  • Joined: 16-October 07

Re: More efficient multi string splitting?

Posted 16 September 2011 - 05:19 PM

I'd never seen the timeit module. As cute as it is, I'm unimpressed. You have to send the "statement" as text. This means it's hard to give it a local context.

Here's my quick test:
import timeit

def testFunc():
	if hasattr(int, '__nonzero__'):
		pass

t = timeit.Timer(stmt="if hasattr(int, '__nonzero__'): pass")
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)

t = timeit.Timer(stmt="testFunc()")
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)



Result:
1.02 usec/pass
Traceback (most recent call last):
  File "timings.py", line 11, in <module>
    print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
  File "/usr/lib/python2.6/timeit.py", line 194, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
NameError: global name 'testFunc' is not defined



Scope in Python can be loose, but not that loose. That first one is from the doc page, the second is how I'd expect to call it. Obviously, the second is no good. This is probably what you're doing.

Here's the kind of code I'd use for timing:
import time

def timeProcess(func, repeat=100000):
	start_time = time.time()
	for i in xrange(repeat):
		result = func()
	elapsed = time.time() - start_time
	usec = 1000000.0 * elapsed/float(repeat) 
	return ("%.2f usec/pass" % (usec), elapsed, result, repeat)


def testFunc():
	if hasattr(int, '__nonzero__'):
		pass

print timeProcess(testFunc)



Result:
('1.77 usec/pass', 0.17707586288452148, None, 100000)

Was This Post Helpful? 0
  • +
  • -

#10 Nekroze  Icon User is offline

  • D.I.C Head

Reputation: 14
  • View blog
  • Posts: 170
  • Joined: 08-May 11

Re: More efficient multi string splitting?

Posted 17 September 2011 - 01:29 AM

that is a much simpler testing method thank you for clearing this all up guys!

All should be fine now.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1