11 Replies - 2161 Views - Last Post: 07 May 2012 - 05:32 AM

#1 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

What is captured within $1 if used with alternation?

Posted 04 May 2012 - 06:59 AM

Hi, if I use something like this:

"I love fred and barney" =~ /(barney|fred)/ ? print "$1\n" : print "No.\n";



it prints out fred. It suggests that the regex engine starts scanning the string from left to right searching whether any pattern on either side of the alternation matches it, and if it does, then that is stored inside $1 and the process ends.

But if I run something like this:

"I love fredder and fred" =~ /(fred|fredder)/ ? print "$1\n" : print "No.\n";



prints out fred. Surely while scanning from left, fredder comes first, so that should be stored in $1, but it is not. Not only that, whatever configuration be the strings fred and fredder in the regex and string, i.e fredder and fred and (fred|fredder), fred and fredder and (fred|fredder), fredder and fred and (fredder|fred), fred and fredder and (fredder|fred), it always outputs fred. Can anyone explain what is going on?

Is This A Good Question/Topic? 1
  • +

Replies To: What is captured within $1 if used with alternation?

#2 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: What is captured within $1 if used with alternation?

Posted 05 May 2012 - 01:57 AM

View Postcupidvogel, on 04 May 2012 - 02:59 PM, said:

Hi, if I use something like this:

"I love fred and barney" =~ /(barney|fred)/ ? print "$1\n" : print "No.\n";



it prints out fred. It suggests that the regex engine starts scanning the string from left to right searching whether any pattern on either side of the alternation matches it, and if it does, then that is stored inside $1 and the process ends.


Correct. The Perl regex engine looks for the "leftmost longest" match and it stops processing as soon as it's found that. (I think this is true for all other regex engines as well. Or at least I've never run into one which behaves differently.)

View Postcupidvogel, on 04 May 2012 - 02:59 PM, said:

But if I run something like this:

"I love fredder and fred" =~ /(fred|fredder)/ ? print "$1\n" : print "No.\n";



prints out fred. Surely while scanning from left, fredder comes first, so that should be stored in $1, but it is not.


This one surprised me a little, since "fredder" is longer than "fred", but the alternation uses short-circuit evaluation - if the first pattern (fred) matches at a position in the string, then then second pattern (fredder) is never tested at that position because a match has already been found. Basically, this behavior is caused by the first pattern in the alternation being a subset of the second pattern.

View Postcupidvogel, on 04 May 2012 - 02:59 PM, said:

Not only that, whatever configuration be the strings fred and fredder in the regex and string, i.e fredder and fred and (fred|fredder), fred and fredder and (fred|fredder), fredder and fred and (fredder|fred), fred and fredder and (fredder|fred), it always outputs fred. Can anyone explain what is going on?


My results are different. If I use (fredder|fred) in the regex, it matches whichever one comes first in the string:
$ perl -E 'print "I love fred and fredder" =~ /(fredder|fred)/ ? "$1\n" : "No.\n";'
fred
$ perl -E 'print "I love fredder and fred" =~ /(fredder|fred)/ ? "$1\n" : "No.\n";'
fredder
$ perl -v
This is perl, v5.10.1 (*) built for i686-linux-gnu-thread-multi



And, on another machine:
$ perl -e 'print "I love fred and fredder" =~ /(fredder|fred)/ ? "$1\n" : "No.\n";'
fred
$ perl -e 'print "I love fredder and fred" =~ /(fredder|fred)/ ? "$1\n" : "No.\n";'
fredder
$ perl -v
This is perl, v5.8.8 built for x86_64-linux-thread-multi



If you're getting "fred" as the result when "fredder" is first in both the string and the regex, then that's probably a bug in your version of perl.
Was This Post Helpful? 2
  • +
  • -

#3 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 05 May 2012 - 11:53 PM

No no, I am sorry, I am getting fredder for the example you provided. But now I have another dilemma, I think it should match fred! The two options are for matching fred and fredder, and even though fredder comes earlier in the string, the fred part of fredder should match fred before the entirety of fredder matches another option fredder, so I think $1 should store fred. But it is not doing so. How come?
Was This Post Helpful? 0
  • +
  • -

#4 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 01:06 AM

View Postcupidvogel, on 06 May 2012 - 07:53 AM, said:

now I have another dilemma, I think it should match fred! The two options are for matching fred and fredder, and even though fredder comes earlier in the string, the fred part of fredder should match fred before the entirety of fredder matches another option fredder, so I think $1 should store fred. But it is not doing so. How come?


As I said in my earlier reply, this is due to short-circuit evaluation of the alternation. Just like the regular (non-regex) | and || operators, the regex alternation operator (also |) first tests whatever is to the left of it and, if that test succeeds, it immediately returns the result without testing whatever is to its right at all. The second pattern is only tested if the first pattern fails to match.

So, if you use the pattern (fred|fredder), it will never return "fredder" because you can't match "fredder" at a position unless "fred" also matches at that position and, if "fred" matches, it will be returned immediately without testing whether "fredder" might also match. On the other hand, if you use the pattern (fredder|fred), it will only check for a match on "fred" if "fredder" fails to match at that position, so "fred" will only be returned if it is not followed by "der".
$ perl -E 'say "I love fredder and fred" =~ /(fredder|fred)/ ? $1 : "No.";'
fredder
$ perl -E 'say "I love fredder and fred" =~ /(fred|fredder)/ ? $1 : "No.";'
fred


Was This Post Helpful? 1
  • +
  • -

#5 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 01:31 AM

The short-circuit stuff is what I am talking about. Suppose the regex engine is at the position just after fre of fredder, and it is yet to see a string that matches any of the options in the alternation, right? Then it sees the d, and right away it should spot that the last 4 letters form a string that matches one of the options, so it should stop and exit, storing fred in $1. Why should it wait longer to see whether the first option fredder matches any substring, when it has found a match already?
Was This Post Helpful? 0
  • +
  • -

#6 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 02:09 AM

View Postcupidvogel, on 06 May 2012 - 09:31 AM, said:

The short-circuit stuff is what I am talking about. Suppose the regex engine is at the position just after fre of fredder, and it is yet to see a string that matches any of the options in the alternation, right?


Ah! Now I see where the problem is - I didn't explain what I meant by "position"!

When talking about regex matching, the "position" is where it's attempting to start a match. So, while it's doing all this, the "position" that it's checking is (just before) the "f", not between the "e" and the "d". Given the pattern (fredder|fred), the regex engine doesn't say, "I've got an 'f', does that match fredder or fred? No. I've got an 'fr'. Does that match? No. How about 'fre'? No. Maybe 'fred'? Yes! A match!" Rather, it sits before the "f" and says, "Starting from here, can I match 'fredder'?" and then, if that fails, "Starting from here, can I match 'fred'?"

View Postcupidvogel, on 06 May 2012 - 09:31 AM, said:

Then it sees the d, and right away it should spot that the last 4 letters form a string that matches one of the options, so it should stop and exit, storing fred in $1. Why should it wait longer to see whether the first option fredder matches any substring, when it has found a match already?


Actually, if it did it that way, it would always find "fredder" (regardless of the order of "fred" and "fredder" in the alternation) because, when matching a pattern, it tries find the leftmost longest match and "fredder" is longer than "fred". You can see this by reducing your alternation to a single pattern[1]:
$ perl -E 'say "I love fredder and fred" =~ /(fred(?:der)?)/ ? $1 : "No.";'
fredder


Since the "der" is optional here, it could match either "fred" or "fredder" at the position before the first "f", but it chooses the longer match.

The preference for longer matches is, incidentally, why people tend to get into trouble with .* so often... They'll want to find the first word starting with "l" and ending with "e" with the pattern l.*e, then stumble across a result like
$ perl -E 'say "I love fredder and fred" =~ /(l.*e)/ ? $1 : "No.";'
love fredder and fre

when they expected it to match "love".

[1] (fredder|fred) is actually two independent patterns, "fredder" and "fred", joined so that matching either one produces a successful match of the alternation as a whole.
Was This Post Helpful? 1
  • +
  • -

#7 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 02:26 AM

You say that the engine doesn't sit between e and d, but before f and asks can I match... Why if f special here that the engine sits before it and announces, and not between, say e and d?
Was This Post Helpful? 0
  • +
  • -

#8 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 03:25 AM

View Postcupidvogel, on 06 May 2012 - 10:26 AM, said:

You say that the engine doesn't sit between e and d, but before f and asks can I match... Why if f special here that the engine sits before it and announces, and not between, say e and d?


Nothing special about it, it just made for a shorter explanation than if I went through the entire process of:
- Start at the beginning of the string.
  - Test to see if "fredder" matches beginning at the start of the string.
  - Test to see if "fred" matches beginning at the start of the string.
- Advance to between the "I" and the first space.
  - Test to see if "fredder" matches beginning before the first space.
  - Test to see if "fred" matches beginning before the first space.
- Advance to before the "l".
...



The important point is that the regex engine tests each pattern as a complete unit ("Can I match 'fredder' here? Can I match 'fred' here?") rather than doing character-by-character tests of all patterns in parallel ("Does 'f' match either 'fredder' or 'fred'? Does 'fr' match either 'fredder' or 'fred'?...")
Was This Post Helpful? 1
  • +
  • -

#9 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 03:28 AM

Wow. Thanks a lot! Dunno what would I have done in Perl without you! :bananaman:
Was This Post Helpful? 0
  • +
  • -

#10 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 06 May 2012 - 08:10 AM

Ummm,I just noticed, if the engine indeed starts with the leftmost option, and advances to the next option only when there is no match for it, why should the engine advance to the next character if none of the options match? Surely if a string of 20 characters doesn't have a match for 5 options, the same string minus the first character cannot do better, right?
Was This Post Helpful? 0
  • +
  • -

#11 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: What is captured within $1 if used with alternation?

Posted 07 May 2012 - 02:15 AM

It's checking at each position for any matches which start at that position. "I love fredder and fred." doesn't contain a match for (fredder|fred) at the start of the string, or after the first character, etc. The first match begins from the position between the second space and the first "f".

Your idea of "if the string doesn't contain a match at the first position, it won't contain a match anywhere else" only applies if the pattern begins with .*. (Because of this, and because of the inefficiency it creates, most regex engines (including Perl's) will ignore .* as the first or last element of a pattern unless it's within capturing parentheses.)
Was This Post Helpful? 1
  • +
  • -

#12 cupidvogel  Icon User is offline

  • D.I.C Addict

Reputation: 31
  • View blog
  • Posts: 593
  • Joined: 25-November 10

Re: What is captured within $1 if used with alternation?

Posted 07 May 2012 - 05:32 AM

Ummm, I really didn't get that. Please elaborate...
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1