Mixing languages. Now you have two problems.

  • (3 Pages)
  • +
  • 1
  • 2
  • 3

31 Replies - 3097 Views - Last Post: 15 April 2013 - 08:59 PM

#16 xclite  Icon User is offline

  • LIKE A BOSS
  • member icon


Reputation: 905
  • View blog
  • Posts: 3,167
  • Joined: 12-May 09

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 11:42 AM

View Postandrewsw, on 11 April 2013 - 10:40 AM, said:

Regexper is useful for visualizing regex, and kinda cool!

Posted Image

I would be a little nervous of that composed regex approach. I think it requires quite a bit of confidence to be able to split regex like that, and to be certain that when re-combined the full expression remains valid. [Note: not every regex variant supports in-line comments.]

I prefer just to accept that regex can be complex, but also can be extremely useful. I would prefer just to precede them with a few lines of comment, describing the pattern they are trying to match, and any exceptions that I had to account for. I think it is pointless trying to describe them in detail because, if they ever needed revising, I know I will have to start from scratch anyway :)/>/>

On the topic of useful tools, I use rubular. Has support for capturing groups and allows me to see how my pattern does against various lines of text.
Was This Post Helpful? 0
  • +
  • -

#17 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 1992
  • View blog
  • Posts: 4,140
  • Joined: 11-December 07

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 11:47 AM

I think I pretty much have a syntax that would work, and I think the tools would be pretty trivial to build. Looks like I have a new project! Here is some syntax that includes imports, comments and groups.

// This is a comment.
// Here is an import
use ./html.regex

// The colons are not special syntax. They are just part of the
// identifiers used for readability.

element = legalTag:anyCharsLazy:closeLegalTag:
legalTag: = ((anchorTag)|(paragraphTag))
anyCharsLazy: = .*?
closeLegalTag: = (?(2)closeAnchor|closeParagraph)
closeAnchor = </a>
closeParagraph = </p>

// Using the number for the group is pretty bad so how about some special syntax for that:

element: = legalTag:anyCharsLazy:closeLegalTag:(anchorGroup)
legalTag: [allTagGroups anchorGroup paragraphGroup] = ((anchorTag)|(paragraphTag))
closeLegalTag:(group) = (?(group)closeAnchor|closeParagraph)

// This is getting unweildy so let's introduce some wildcards to match groups we don't care about.

element: = legalTag:anyCharsLazy:closeLegalTag:(anchor)
legalTag: [_ anchor _] = ((anchorTag)|(paragraphTag))
closeLegalTag:(group) = (?(group)closeAnchor|closeParagraph)

Was This Post Helpful? 0
  • +
  • -

#18 AdamSpeight2008  Icon User is offline

  • MrCupOfT
  • member icon


Reputation: 2262
  • View blog
  • Posts: 9,464
  • Joined: 29-May 08

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 02:14 PM

A nightmare for RegEx writer to extract the correct tag nodes.
// <tag> 
 <tag>
   <tag>
     "<tag>"
   </tag>
   <tag>
     "</tag>"
   </tag>   
 </tag>
//</tag>


Was This Post Helpful? 0
  • +
  • -

#19 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 1992
  • View blog
  • Posts: 4,140
  • Joined: 11-December 07

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 02:53 PM

This matches the bits in blue. However, I htink I have missed the point of the challenge.

export tags = open:contents:close:
	open: = quote(<tag>)
		quote x = \"x\"
	contents: = (.*?\s*?)*?
	close: = quote(</tag>)

assert-matches tags "// <tag>\n <tag>\n   <tag>\n     "<tag>"\n   </tag>\n   <tag>\n     "</tag>"\n   </tag>   \n </tag>\n//</tag>"

Was This Post Helpful? 0
  • +
  • -

#20 AdamSpeight2008  Icon User is offline

  • MrCupOfT
  • member icon


Reputation: 2262
  • View blog
  • Posts: 9,464
  • Joined: 29-May 08

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 03:18 PM

A tag node is has a start node <tag> and a closing node </tag>.

A tag can contain other tag nodes.


If the regex returns
// <tag> <tag> <tag> "<tag>" </tag>

Wrong: First <tag> is in comment and shouldn't match

<tag> <tag> "<tag>" </tag>
Wrong: Not the balancing pair of tags.

"<tag>" </tag>
Wrong: String quoted <tag> is not valid.

<tag> "<tag>" </tag> <tag> "</tag>" </tag>

Wrong: Not the balancing pair of tags.

The first tag which should matched is the following.
// <tag> <tag> <tag> "<tag>" </tag> <tag> "</tag>" </tag> </tag>//</tag>

This post has been edited by AdamSpeight2008: 11 April 2013 - 03:22 PM

Was This Post Helpful? 0
  • +
  • -

#21 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7743
  • View blog
  • Posts: 13,082
  • Joined: 19-March 11

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 03:59 PM

When did XML/HTML start using // comments?
Was This Post Helpful? 0
  • +
  • -

#22 AdamSpeight2008  Icon User is offline

  • MrCupOfT
  • member icon


Reputation: 2262
  • View blog
  • Posts: 9,464
  • Joined: 29-May 08

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 04:02 PM

Who said its XHTML / XML?
Was This Post Helpful? 0
  • +
  • -

#23 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7743
  • View blog
  • Posts: 13,082
  • Joined: 19-March 11

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 04:12 PM

So this is just random tossery then?
Was This Post Helpful? 0
  • +
  • -

#24 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 1992
  • View blog
  • Posts: 4,140
  • Joined: 11-December 07

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 04:22 PM

Quote

A nightmare for RegEx writer to extract the correct tag nodes.


Ah OK. Yes. It's a nightmare because regex is the wrong tool for the job. It would be easy to write a regex to get the right bit out of this one, but tags can be nested arbitrarily deep. Does regex even have the syntax for that?
Was This Post Helpful? 0
  • +
  • -

#25 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7743
  • View blog
  • Posts: 13,082
  • Joined: 19-March 11

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 04:33 PM

View Postcfoley, on 11 April 2013 - 06:22 PM, said:

Quote

A nightmare for RegEx writer to extract the correct tag nodes.


Ah OK. Yes. It's a nightmare because regex is the wrong tool for the job. It would be easy to write a regex to get the right bit out of this one, but tags can be nested arbitrarily deep. Does regex even have the syntax for that?



Apparently, the answer is sometimes yes, though you might say this gets out of the bounds of regular regex.
Was This Post Helpful? 1
  • +
  • -

#26 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 1992
  • View blog
  • Posts: 4,140
  • Joined: 11-December 07

Re: Mixing languages. Now you have two problems.

Posted 11 April 2013 - 05:26 PM

mmm very interesting. Thanks for the link. :)
Was This Post Helpful? 0
  • +
  • -

#27 Martyr2  Icon User is offline

  • Programming Theoretician
  • member icon

Reputation: 4333
  • View blog
  • Posts: 12,128
  • Joined: 18-April 07

Re: Mixing languages. Now you have two problems.

Posted 14 April 2013 - 10:14 AM

I find that if you are developing some cryptic regex then it is either due to two things.... 1) You don't fully understand your data or 2) Your data is too complex in structure to match it accurately.

The second case should be a red flag to tell you that you need to break down your data further before applying a regex to it. For instance instead of applying a regex to match an entire email for RFC compliance, perhaps you need to first break it down into mailbox, hostname and domain and then test each piece separately with a more simple regex. It may be less compact but at least it will be easier to see what is being done and more readable to boot. Perhaps you have three functions testMailBoxName(), testMailHostname() and testMailDomain().

As for the first case, well, you should probably be asking yourself about what you know of the data first and think of a proper test for it before even approaching a regex.

Now as for the topic of mixing languages, I think there is always going to be some mixing and I don't think it is a bad thing. You do want to stay in the host language as much as possible but sometimes it is unavoidable. As already mentioned, it is almost impossible to stay away from HTML when doing PHP. This is because PHP was designed initially as a templating language that actually was meant to put out HTML as a result. SQL is the language of databases so it is understandable that some of it is going to creep into any language that wants to interact with databases.

It does require the programmer to have a bit of knowledge of both languages to be effective, but that is just the nature of the beast I am afraid. Perhaps this is where we should also do more regular consulting with domain experts in the other language if we are worried that we are not doing something to the best of our ability. :)

This post has been edited by Martyr2: 14 April 2013 - 10:18 AM

Was This Post Helpful? 2
  • +
  • -

#28 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: Mixing languages. Now you have two problems.

Posted 14 April 2013 - 11:07 AM

I actually think domain specific languages and mixing the proper languages are amazing ways to tackle a vast number of problems. The issues I've seen with regex, SQL, etc... have more to do with poor integration into the host language than anything(PHP and mysqli: I'm looking at you).


Parser generators, embedded parser combinator libraries, etc... are excellent examples using the right language for the job.

Given a certain problem domain, and domains can be nested, not every language will tackle the situation equally as well so it makes sense to nest the usage of languages. The question is how well can you nest languages? certainly mysqli in php is not, on its own, very good nesting but with a little effort you get something like LINQ and then the integration is quite nice.

side note:

Using regexs to describe structured data is a bad idea even if it works in some cases. They only describe regular languages(hence their name), not context free, context sensitive, or recursively enumerable languages. If you want to handle structured data you need a parser that can divine the structure from the text properly. Parser generators and various EDSLs like Boost.Sprit are better picks
Was This Post Helpful? 2
  • +
  • -

#29 sepp2k  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 2113
  • View blog
  • Posts: 3,235
  • Joined: 21-June 11

Re: Mixing languages. Now you have two problems.

Posted 14 April 2013 - 11:29 AM

View Postishkabible, on 14 April 2013 - 08:07 PM, said:

They only describe regular languages(hence their name), not context free, context sensitive, or recursively enumerable languages.


That's not always true for modern regexen anymore though. For example the language matched by (.*)\1 isn't regular - it's not even context free.

That doesn't change the fact that regexen are a bad tool to match structured data though.
Was This Post Helpful? 1
  • +
  • -

#30 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: Mixing languages. Now you have two problems.

Posted 14 April 2013 - 12:49 PM

what does that regex match?
Was This Post Helpful? 0
  • +
  • -

  • (3 Pages)
  • +
  • 1
  • 2
  • 3