14 Replies - 8863 Views - Last Post: 29 June 2011 - 04:48 PM Rate Topic: -----

#1 tootypegs  Icon User is offline

  • D.I.C Head

Reputation: 1
  • View blog
  • Posts: 239
  • Joined: 09-October 07

How can you remove everything except a-z from a string?

Posted 28 June 2011 - 03:18 PM

Is there a way to remove everything from a string except from the leters a - z? some of the strings i get wen i process my file contain symbols, unicode and numbers but i want to remove all of these and just have the letters left. Is there a way to accomplish this?

cheers guys
Is This A Good Question/Topic? 0
  • +

Replies To: How can you remove everything except a-z from a string?

#2 Simown  Icon User is offline

  • Blue Sprat
  • member icon

Reputation: 319
  • View blog
  • Posts: 650
  • Joined: 20-May 10

Re: How can you remove everything except a-z from a string?

Posted 28 June 2011 - 03:48 PM

Just lowercase a-z, and spaces too?

One solution is:

>> string = "!s*imo*w!?!?!@@~n"
>> newString = ''
>> validLetters = "abcdefghijklmnopqrstuvwxyz"
>> for char in string:
    if char in validLetters:
        newString += char
>> print newString
....
'simown'



Or, more concisely, and Pythonic :):

>> newString = ''.join([char for char in string if char in validLetters])
>> print newString
...
'simown'



The logic is the same behind each of the method, whichever is clearer to you.

This post has been edited by Simown: 28 June 2011 - 03:53 PM

Was This Post Helpful? 0
  • +
  • -

#3 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2100
  • View blog
  • Posts: 3,197
  • Joined: 21-June 11

Re: How can you remove everything except a-z from a string?

Posted 28 June 2011 - 04:57 PM

If you choose Simown's second solution you may want to remove the brackets (i.e. use a generator expression instead of a list comprehension) to avoid creating a temporary list.

Note that using either solution if you want to keep upper case letters as well as lower case letters, you'll have to include them in validLetters as well.

Or if you don't want to type out all the letters, you can just use a regex like this:

import re
new_string = re.sub("[^a-zA-Z]","", old_string)


This post has been edited by sepp2k: 29 June 2011 - 07:44 AM

Was This Post Helpful? 1
  • +
  • -

#4 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5796
  • View blog
  • Posts: 12,631
  • Joined: 16-October 07

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 04:32 AM

I'd go with a join generator solution. Regex is cute, but far more overhead than job calls for. For max flexibly, I'd have a validation function.

e.g.
def sanitize(s):
	def isValid(c): 
		return c in "abcdefghijklmnopqrstuvwxyz"
	return ''.join(c for c in s if isValid(c))


Was This Post Helpful? 0
  • +
  • -

#5 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2100
  • View blog
  • Posts: 3,197
  • Joined: 21-June 11

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 05:03 AM

View Postbaavgai, on 29 June 2011 - 01:32 PM, said:

I'd go with a join generator solution. Regex is cute, but far more overhead than job calls for.


Actually I'm pretty sure that the overhead of iterating over validLetters each time is much more performance overhead than using regex. And sure enough: according to a quick benchmark the regex solution is faster than the generator solution by about 25% (which again is faster than the for loop by about the same amount).

And much more importantly writing a-zA-Z is much less error prone than typing out all 26 letters in the alphabet manually - twice.
Was This Post Helpful? 0
  • +
  • -

#6 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 06:18 AM

When in doubt, keep it simple... and use built-ins since they're implemented in c ;)

def removeNonLetters(input_string):
	retString = ""
	for eachLetter in input_string:
		if eachLetter.isalpha():
			retString += eachLetter
	return retString

Was This Post Helpful? 0
  • +
  • -

#7 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2100
  • View blog
  • Posts: 3,197
  • Joined: 21-June 11

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 07:02 AM

View Postatraub, on 29 June 2011 - 03:18 PM, said:

When in doubt, keep it simple... and use built-ins since they're implemented in c ;)


Interestingly using isalpha is actually slower than using valid_letters. On the other hand it does correctly handle non-ASCII letters and is definitely more readable and less error prone than typing out the whole alphabet yourself. That said, I'd still recommend join over += in a loop because a) it's faster and B) it's more idiomatic (and arguably simpler because it hides the loop).

If correct handling of non-ASCII letters is not necessary though, I'd still prefer the regex solution because it most closely matches the problem that is being solved: If I look at the for loop I read "Go through the original string and add each character to the new string which is a letter". If I look at the solution using join I read: "Build a string containing all the characters in the original string which are letters". And if I look at the regex I read: "Replace all characters which are not letters with nothing". So in my eyes that's the simplest solution since it requires the least steps from what the code says to what the problem description says (which is "Remove everything that is not a letter from the string"). Plus it's still the fastest solution.
Was This Post Helpful? 0
  • +
  • -

#8 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 07:35 AM

Keep in mind that calling code "Clever" isn't actually a compliment in the Python world. We strive for readability and simplicity. There are plenty of good ways to optimize your code (for example, by using Cython) if speed is your primary focus.
Was This Post Helpful? 0
  • +
  • -

#9 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2100
  • View blog
  • Posts: 3,197
  • Joined: 21-June 11

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 07:37 AM

View Postatraub, on 29 June 2011 - 04:35 PM, said:

Keep in mind that calling code "Clever" isn't actually a compliment in the Python world.

Who called my code clever?

Quote

We strive for readability and simplicity.

Exactly.
Was This Post Helpful? 0
  • +
  • -

#10 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 07:41 AM

View Postsepp2k, on 29 June 2011 - 10:37 AM, said:

View Postatraub, on 29 June 2011 - 04:35 PM, said:

Keep in mind that calling code "Clever" isn't actually a compliment in the Python world.

Who called my code clever?

I never said anyone did, don't get defensive friend :) No one's attacking you or your regex.

To be honest, I didn't notice your code because you forgot to use code tags ;)

If you really want to hide the loop, and readability is less of a factor, you could just use a list comprehension to get the job done.

def removeNonLetters(input_string):
	return ''.join([eachLetter for eachLetter in input_string if eachLetter.isalpha()])


For a personal project, I might do something like this because I can easily read it. However, it's not the sort of thing I would give to someone asking for help because of the complexity of it.

This post has been edited by atraub: 29 June 2011 - 07:50 AM

Was This Post Helpful? 0
  • +
  • -

#11 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2100
  • View blog
  • Posts: 3,197
  • Joined: 21-June 11

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 07:51 AM

View Postatraub, on 29 June 2011 - 04:41 PM, said:

I never said anyone did, don't get defensive friend :) No one's attacking you or your regex.


Sorry that I misunderstood you. When I said "I think the regex is the most readable solution" and you said "Clever isn't a compliment", I just took that to imply that you disagreed and found it too clever.

Quote

To be honest, I didn't notice your code because you forgot to use code tags ;)


Oops, my bad.

Quote

If you really want to hide the loop, and readability is less of a factor, you could just use a list comprehension to get the job done.

def removeNonLetters(input_string):
	return ''.join([eachLetter for eachLetter in input_string if eachLetter.isalpha()])


Yes, that's exactly the solution I had in mind when I said "I'd still recommend join over += in a loop" (minus the brackets - as I remarked in my first post in this thread, I'd prefer generator expressions over list comprehensions wherever applicable).

Quote

However, it's not the sort of thing I would give to someone asking for help because of the complexity of it.

And I just can't understand that. How does abstracting away the loop (and cutting down on the number of imperatively updated variables) not reduce the complexity?

This post has been edited by sepp2k: 29 June 2011 - 07:54 AM

Was This Post Helpful? 0
  • +
  • -

#12 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 08:21 AM

just try to read that sucker. It's much harder to parse it; especially if you've never seen a comprehension before... which most newer pythoners haven't.
Was This Post Helpful? 0
  • +
  • -

#13 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5796
  • View blog
  • Posts: 12,631
  • Joined: 16-October 07

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 08:58 AM

View Postsepp2k, on 29 June 2011 - 08:03 AM, said:

And sure enough: according to a quick benchmark the regex solution is faster than the generator solution by about 25% (which again is faster than the for loop by about the same amount).


You have the code? We love tests here!

Right, here's my test. First, some nice timers:
import time

def timeProcess(prompt, func):
	start = time.time()
	result = func();
	elapsed = time.time() - start
	print("   {:f}s : {}".format(elapsed, prompt))
	return result

def timeGroup(name, items):
	print(name)
	for (prompt, func) in items:
		timeProcess(prompt, func)
	print('')



Now some test code:
def testIn(data):
	return ''.join(c for c in data if c in "abcdefghijklmnopqrstuvwxyz")

def testReg(data):
	return re.sub("[^a-z]","", data)



And, fire:
def timedTest(tests):
	data = '123dff4ad4sdfas2'
	timeGroup('Simple String', ((n, lambda : f(data)) for (n,f) in tests))

	with open('/etc/passwd','r') as fh:
		data = fh.read()
	timeGroup('File', ((n, lambda : f(data)) for (n,f) in tests))

if __name__ == "__main__":
	testReg('foo') # this compiles the regex, for a fair timing
	timedTest( (('In', testIn),('Regex', testReg),) )



Results:
Simple String
   0.000044s : In
   0.000035s : Regex

File
   0.001019s : In
   0.000702s : Regex



Hmm, regex stomped the other. You are correct! I wonder how badly we'll get beaten with all alpha?

def testAlpha(data):
	return ''.join(c for c in data if c.isalpha())

def testInAll(data):
	return ''.join(c for c in data if c in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")

def testRegAll(data):
	return re.sub("[^A-Za-z]","", data)

if __name__ == "__main__":
	testRegAll('foo')
	timedTest( (('In', testInAll),('Regex', testRegAll),('Isalpha', testAlpha)) )



Results:
Simple String
   0.000029s : In
   0.000031s : Regex
   0.000030s : Isalpha

File
   0.001238s : In
   0.000657s : Regex
   0.001297s : Isalpha



Now, that is unexpected and kind of neat. Regex clearly scales better. However, for most applications, a nice pythony "in" is no cause for concern.
Was This Post Helpful? 1
  • +
  • -

#14 Nallo  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 163
  • View blog
  • Posts: 255
  • Joined: 19-July 09

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 04:24 PM

Not too surprising. testInAll has to loop over "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" for each character to make the in test. But what with an object that has a faster "in test" like a set? Would you mind matching the following against the other ones?
def modified_testInAll(data):
    charset = set("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
    return ''.join(c for c in data if c in charset)


Was This Post Helpful? 0
  • +
  • -

#15 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5796
  • View blog
  • Posts: 12,631
  • Joined: 16-October 07

Re: How can you remove everything except a-z from a string?

Posted 29 June 2011 - 04:48 PM

Good call. Timings on a different computer:
Simple String
   0.000039s : In
   0.000023s : InSet
   0.000044s : Regex
   0.000027s : Isalpha

File
   0.000587s : In
   0.000446s : InSet
   0.000471s : Regex
   0.000553s : Isalpha


Was This Post Helpful? 0
  • +
  • -

Page 1 of 1