Regular expression to remove non Ascii Characters

remove non ascii characters but keep tab, LF and CR

Page 1 of 1

11 Replies - 62118 Views - Last Post: 09 January 2009 - 12:44 PM

#1 xesecre   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 32
  • Joined: 29-July 08

Regular expression to remove non Ascii Characters

Posted 24 December 2008 - 10:36 AM

Hello and HAPPY HOLIDAYS to all


I am looking for a regular expression to remove all NON-ASCII characters but I want to keep TAB, LINEFEEDS and CARRIAGE returns

I have this expression I have been using but it doesnt seem to be keeping all of them away because in our lists from time to time new characters show in our text files that more of less show up as a square

thanks any help is appreciated, I went through the internet for 2 days looking for this solution
and tried out many combinations only to not be successful
thanks again
John
here is is my code for this

 $filtered =~ s/[^!-~\s]//g;


Is This A Good Question/Topic? 0
  • +

Replies To: Regular expression to remove non Ascii Characters

#2 GWatt   User is offline

  • member icon

Reputation: 309
  • View blog
  • Posts: 3,106
  • Joined: 01-December 05

Re: Regular expression to remove non Ascii Characters

Posted 24 December 2008 - 11:20 AM

What do you mean by non-ascii character? afaik, ascii characters are any one of the characters represented by the numbers 0-127.
Was This Post Helpful? 0
  • +
  • -

#6 mocker   User is offline

  • D.I.C Regular
  • member icon

Reputation: 51
  • View blog
  • Posts: 466
  • Joined: 14-October 07

Re: Regular expression to remove non Ascii Characters

Posted 24 December 2008 - 11:35 AM

Your regex works on the unicode values I tested, though it will also remove linefeeds, tabs and carriage returns unless you add \t\r\n to it . Do you have any way to view the characters that are getting through in a program that supports UTF8 so you can see what the actual character is

Another way of writing the same code that might work better with utf8 is
/[^\x{21}-\x{7E}\s\t\n\r]/


Was This Post Helpful? 1

#10 KevinADC   User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: Regular expression to remove non Ascii Characters

Posted 24 December 2008 - 01:40 PM

There might be better ways but this can work:

my %good = map {$_=>1} (9,10,13,32..127);
$filtered = "this\bis a\ttest";
$filtered =~ s/(.)/$good{ord($1)} ? $1 : ' '/eg;
print $filtered;



Note that \b (backspace) is removed from the string because its ascii number is 8 which is not in the list of "good" ascii numbers but the tab remains, which is 9. Just look up an ascii table and add are remove values as needed. You also have to decide if you want to replace a removed character with nothing or a space or something else. The code above replaces them with a single space: ' '

This post has been edited by KevinADC: 24 December 2008 - 01:44 PM

Was This Post Helpful? 1

#11 KevinADC   User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: Regular expression to remove non Ascii Characters

Posted 25 December 2008 - 11:58 AM

xesecre,

it should not remove lower case characters as they are 97-122
Was This Post Helpful? 0
  • +
  • -

#12 xesecre   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 32
  • Joined: 29-July 08

Re: Regular expression to remove non Ascii Characters

Posted 27 December 2008 - 01:56 PM

Thanks for all your wonderful advice, it makes it clearer now within a regular expression
one other thought about Regular expressions
if I was to make a regular expression to keep all the Carriage return - Line feeds but get rid of the Single Carriage returns within the middle of a tab delimited row or Vis versa
get rid of the single line feeds but keep the line feeds that are next to a Carriage return
I looked up a lot of information in the web and in some books and still cannot figure out how to do that or even how to figure out the concept on this
could a person use the match expression within a if statement or would you want to use a substitution within a if statement or better yet can you just use a Reg Expression in one line to do the whole job?
thanks any help is appreciated
my brother bought me a Learning Perl 5th edition by O'Reilly for Christmas and I really like that book and plan on really checking it out


View PostKevinADC, on 25 Dec, 2008 - 10:58 AM, said:

xesecre,

it should not remove lower case characters as they are 97-122

Was This Post Helpful? 0
  • +
  • -

#13 KevinADC   User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: Regular expression to remove non Ascii Characters

Posted 27 December 2008 - 03:16 PM

Quote

if I was to make a regular expression to keep all the Carriage return - Line feeds but get rid of the Single Carriage returns within the middle of a tab delimited row or Vis versa
get rid of the single line feeds but keep the line feeds that are next to a Carriage return


Your descriptions are confusing so I am not sure what to suggest. Provide examples of what you mean and it may clear up any confusion.
Was This Post Helpful? 0
  • +
  • -

#14 xesecre   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 32
  • Joined: 29-July 08

Re: Regular expression to remove non Ascii Characters

Posted 28 December 2008 - 02:26 AM

sorry Kevin
what I am trying to do with is find a regular expression
that would know if there is a carriage return and line feed together to leave that alone but if there is just a line feed without a carriage return anywhere in that row it will take it out
I hope this makes more sense
thanks
John
Was This Post Helpful? 0
  • +
  • -

#15 KevinADC   User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: Regular expression to remove non Ascii Characters

Posted 29 December 2008 - 10:25 AM

all you have to do is (group) the carraige return and the line feed (or newline of thats what you really mean):

$str =~ s/(\r\n)//g;

grouping treats whats inside the parentheses as a single unit.
Was This Post Helpful? 0
  • +
  • -

#16 perfectly.insane   User is offline

  • D.I.C Addict
  • member icon

Reputation: 70
  • View blog
  • Posts: 644
  • Joined: 22-March 08

Re: Regular expression to remove non Ascii Characters

Posted 29 December 2008 - 11:07 PM

I think what you're looking for are zero-width assertions.

For example, if you want to find a lone carriage returns and change them to CRLF:

s/\r(?!\n)/\r\n/g;

To change lone LF's to CR LF's:

s/(?<!\r)\n/\r\n/g;

Or you may remove them completely using a similar solution.

The codes are:

(?=expr) Positive lookahead assertion
(?!=expr) Negative lookahead assertion
(?<=expr) Positive lookbehind assertion
(?<!expr) Negative lookbehind assertion

Also FYI: a sometimes simple way of removing certain characters comes with the tr operator. Such as removing any extended ASCII character can be accomplished by $str =~ tr/\000-\177//cd; # Any character 0 - 177 octal is not removed, hence the c flag.

This post has been edited by perfectly.insane: 29 December 2008 - 11:08 PM

Was This Post Helpful? 0
  • +
  • -

#17 KevinADC   User is offline

  • D.I.C Regular
  • member icon

Reputation: 27
  • View blog
  • Posts: 401
  • Joined: 23-January 07

Re: Regular expression to remove non Ascii Characters

Posted 30 December 2008 - 12:43 PM

good suggestion by "insane". I think he figured out what it is you are trying to do John, so I suggest you give it a try.
Was This Post Helpful? 0
  • +
  • -

#18 xesecre   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 32
  • Joined: 29-July 08

Re: Regular expression to remove non Ascii Characters

Posted 09 January 2009 - 12:44 PM

Thanks to Insaine and Kevin
you guys ROCK!!!!!!!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1