Page 1 of 1

Simple Content Formatting with Regular Expressions Rate Topic: -----

#1 Tenderfoot   User is offline

  • D.I.C Head
  • member icon

Reputation: 12
  • View blog
  • Posts: 162
  • Joined: 21-March 12

Posted 21 December 2012 - 03:29 PM

I decided to write this tutorial primarily for selfish reasons, I wanted to make sure that regular expressions left a permanent print on my mind. However I do hope someone will find this helpful some day and I intend to use this as reference myself if/when I find myself wanting to write something similar. What this tutorial covers, is how to write short regular expressions, and how to write a small markdown system. The system will by no means be a complete implementation of the markdown system, but it will support a few features. Markdown is one way to allow users to format their posts on websites without having the permission to write the actual HTML. Below is the syntax we will support:

  • *text*
  • _text_
  • **text**
  • __text__
  • [descriptive text for a link](the url itself)

Needless to say, this is all plain text as it is, but the first two examples will turn words inside single asterisks (*) and single underscores (_) to their italic counterparts. So *some text here* and _some text here_ become: <em>some text here</em>. Double asterisks/double underscores will become bold, and the last example will become a hyperlink. The reason for why one would want such a system, is because you don't want to give a website's user base the power to insert any HTML they want, in case they screw up (intentionally or not). Sometimes you might want to allow them to format their text, and in order to do that we can allow 'markdown' and support it manually. I should add that whenever a user sends something to the database, you should always use htmlspecialchars on their input prior to sending it down to the database. And on that note, you should always use prepared statements when inserting user input into the database as well, though that isn't directly related to this tutorial in particular.

One thing I almost forgot: We will also be supporting linebreaks and paragraphs with markdown.

In our noble quest to create this system, we will be making use of regular expressions, so I will list a few of them here below, and write/explain a few minor regular expressions before we get started. But before I do that, I should probably explain what a regular expression is, and why we would want to use it. A regular expression is a piece of code that you might write that will/has the capability to match a part of a string. You could for an example use it to find a certain word in a string, or to find a word that contains certain characters, and doesn't contain others. You can use it in a number of ways, with almost an infinite number of optional rules and restrictions for the pattern you want to match. Generally speaking though, when you're looking to replace a part of a string that is as simple as a word, you can just use PHP's built-in function str_replace. It works a lot faster than the RegEx counterpart, which is grep_replace. However it is not as powerful and it is often impossible to use it when you're looking for more complex patterns. But when looking for simple patterns, you should always use str_replace, simply because it works faster and is more efficient (not to mention easier). To take an example:

$myString = "This is a string with a lot of words. Find the occurrence of one word and replace it.";
$myString = str_replace("word", "sentence", $myString); //This will replace every occurrence of the word "word" in $myString with the word "sentence".

There is a problem with this, as mentioned. It's not possible (or difficult) to search for, and replace a more complex pattern than something simple as a word, or a letter. Say you want to locate each occurrence of an opening asterisk, with some text after it, that must also have another closing asterisk. What could you do? That's where regular expressions come in.

The simplest form of a regular expression would be something like: /Chinese/ - this would match the word Chinese in a given text. Do note that regular expressions are case sensitive by default. Before we move on I'm going to list and explain a few of the most commonly used regular expression characters below:

/ (slash) - used to open and close regular expressions. If you actually want to search for a slash in a text, within a regular expression, you will need to escape it with a backslash (\). This is how you would accomplish that: /\//.

^ (caret) - This matches any text as long as it is at the start of the string it is in. For an example, the following: /^RegEx/ would match "RegEx" in the string "RegEx", but would not match the string: "This is RegEx", due to RegEx's location not being at the beginning of the string. ^ also has a special use when used within square brackets, but I will get to that in a bit. But since there is a special character for matching the beginning of a string, there must also be one that matches a pattern at the end of a string:

$ (dollar) - This works as the exact opposite of the caret. If you use: /RegEx$/ you will only match RegEx if it is located at the end of its string. The following: /^RegEx$/ would therefore only match a string that contains the text "RegEx" and nothing else.

. (dot) - Matches any single character (except the /n character). /.../ would therefore match any three characters.

* (asterisk) - The asterisk makes sure that the character before it appears zero or more times. This special character is greedy, which means that if you search for, say, the string: 'Regex is really great, really great', the regular expression /really.*great/ would match 'really great, really great' and not just the first one. Luckily there is a special character that can make this greedy guy snap out of it; the question mark (?). When placed after the asterisk it will force it to go for a minimal match.

+ (plus) - The plus makes sure that the character before it appears one or more times. This one's greedy as well, so use the question mark when necessary. An example of this would be /.+/ - this would match any single character that occurs one or more times.

? (question mark) - When used alone (and not with a prior plus or an asterisk) it makes the character before it optional. When used after a plus or an asterisk it will cause them to be non-greedy. /RegE?x/ would match RegEx, but it would also match Regx (due to the E being optional).

- (hyphen) - Can be used to look for a range of things, for an example a-z, matching any letter from a-z in lowercase, or A-Z, or 0-9.

| (pipe) - Matches either/or. Sort of like the || or OR in if sentences. It is used like this: /cat|dog/ - this will either match cat or dog.

() (round brackets) - This defines a group of characters that have to occur together. You can also refer to round bracketed parts of a regular expression later by using $1/$2/$3/$... to refer to them in order. $1 being the contents of the first round bracket.

/(something|something else)/ matches the strings 'something' and 'something else'.
/Reg(Ex)+/ would match RegEx, and RegExEx, and RegExExEx, but not RegExE. The round brackets dictate that the characters 'E' and 'x' must occur together in that specific order.

[] (square brackets) - Square brackets make up a character class that matches any one character out of the ones listed inside the brackets. It can be used to find a list of characters, like [abc]. This would be the same as using (a|b|c) - that matches a or b or c, but can also be used to match a range of characters (i.e. [A-Z]) which again, matches any of the characters from A-Z in the alphabet (uppercase).

A general square bracket example: /[abcdefg]/ - this would match anything containing any one of these characters. For instance, it'd match '23a' (contains 'a'), 'hc' (contains 'c') but not h.

Within the square brackets is also where you get to see and use the caret's (^) special function. You can specify a list of characters that your regular expression will not match. This is done by opening the square brackets, placing the caret (^) first, and placing the characters you don't want to match after it. An example of this would be: /[^A]/, so this regular expression will match any single character except for 'A'.

/[a-zA-Z0-9]/ - matches anything that contains a character or a number.

To reiterate one thing, if you're looking to search for a character that has special meaning within regular expressions (like plus, asterisk, or something else) you must always escape it. You do that by placing a backslash before it. Therefore the regular expression /\+/ will match any plus in the text you're searching.

And before I forget, there is one thing that can prove to be very useful, and that is the modifier i. You use it by placing it right after the closing regular expression delimiter (the second /). What it does is that, it tells the regular expression not to perform a case sensitive search (which is its native functioning). So the following expression would match both the word "tEsT" and "test": /test/i. Another modifier that's worth noting is /s. /s makes the dot (.) match newlines as well. If you recall, the dot (.) matches any single character except for the newline by default. So the following: /./s matches any single character, including a newline character.

You can also search for newline characters by simply looking for \n. \n is the default character that comes from the user when he presses enter. It's worth nothing that not every computer will output it the same way. A computer running a Windows OS would for an example use \r\n for a newline, whereas old Macintosh computers would use only \r. Some (if not all) new Macintosh computers now simply use \n, as does Linux. But this would need to be taken into account when you're searching for newline characters. Then there is \t for a tab character, and \s for any whitespace character (this includes spaces, tab, and newlines). \S on the other hand matches anything that isn't a whitespace character. If you recall, we can do the same with a pair of square brackets ([]) and a caret (^). It would go like this: /[^\n\r\t ]. I should note that the space at the end of that regular expression was intentional, it will match, well, space (' ').

A few other escape sequences:
\w - This will match any character or digit. You can achieve the same with the regular expression: /[0-9a-z_]/i The /i again is the modifier that will make the search case insensitive. Then there's \W, which is the exact opposite of \w, that is, it matches any non-digit/non-letter. I'll do two more, then we'll start deciphering some regular expressions: \d This will match any digit, it's the same as using /[0-9]/. The capital D (\D) does the exact opposite.

Right on, that's it for the explanations, time for the fun part; creating and deciphering a few regular expressions. I will do a few random ones that I can think of, explain them, and then we'll start implementing our markdown. To start with, we'll use the following string:

"Blue blazer 1. Red blazer 2. Blue blazer 3. Red blazer 4. Blue BlaZer 5. ReD Blazer 6. Blue pants 7."

So let's just say, that we're a store that sells only pants, or, even a store that sells only red pants; but some scoundrel has gotten into our system and edited our data. He's made a few of our pants blazers, and to make matters even worse, he has made the only pants still in our system blue. How can we sort this out? Well, we could use str_ireplace (case insensitive version of str_replace), but we will use regular expressions this time, even though it's slower and less convenient. So the first thing we need to do is to make up a regular expression that matches all these blazers, and change the word "blazer" to "pants". The first thing we notice, is that not all of the blazers are in lowercase, and thus we will need to account for that when we make our expression: /blazer/i - this regular expression will match the word blazer in a non-case sensitive manner due to its /i modifier. In order to replace it however, we will need to do use PHP's built-in function preg_replace. For this, we will make it take 3 arguments. Arg1: The string to search for (our regular expression). Arg2: The string to replace it with ("pants" in this case) and the string to perform this search on, which is the one with the blazers, and the pants.

So first, we will find and replace all the blazers. We will make use of grep_replace and the regular expression we made above:

$texti = "Blue blazer. Red blazer. Blue blazer. Red blazer. Blue BlaZer. ReD Blazer. Blue pants.";

$texti = preg_replace('/blazer/i', 'pants', $texti); 

This will change all of the "blazers" in our string to pants. The part where we change blue to red is identical, you simply replace the regex with /blue/i and 'pants' with 'red'.

Now let's try something a little different. Remember that while round brackets (()) can be used to ensure that a list of characters are placed together, it also means that you can access the characters within the brackets by using the variable $1. And if there are several round brackets, they can be accessed as $1, $2, $3, and so forth. So let's try this:

Let's give ourselves the string: "The extension of our file is myfile.txt" - and now let's find the file name and extension of the file, and then replace that with the file name we just found with our regular expression, and add to that the file extension .pdf. So we'll effectively be renaming any files in our string this: "originalfilename.pdf".

In order to find the file name and the extension, we can use this regular expression: /([^ ]+)\..../. I'll go through this bit by bit:

/ is the opening delimiter for the regular expression.
([^ ]+) looks for one or more characters that aren't a space. [^ ] is the part that specifies that the character can't be a space, and the + means that there can be one or more characters. The part I'm looking for with this part of the regular expression is the file name. The round brackets around it I'm using so I can access this part of the file name with the variable $1.
\. is an escaped dot, seeing as, once we've found the file name, there should be a dot following it. The remaining three dots are for any 3 single characters. This would be the file extension. We then close the regular expression with the closing delimiter /.

Now what we want to do is to replace the file extension of the file with pdf. This is the code for that:
$text = "The extension of our file is myfile.txt"; 
$text = preg_replace('/([^ ]+)\..../', '$1.pdf', $text);

I've already explained the first argument of the preg_replace, the regular expression. The second part is the thing we want to replace the found string with. $1 refers to the characters within our round brackets (the file name) and after that we add the string '.pdf'. The result should be myfile.pdf.

I believe we've covered enough to start implementing our markdown system. So what we're going to do is, we're going to create a function which you can store in its' own file and include where appropriate. That is, where ever you want to output any text that has markdown in it. So in this function what we'll need to do, is to simply take in an argument for the text, and then use regular expressions to replace the markdown in the text with HTML tags. You can then return the text with the HTML you allow, and echo it where you want to. So let's get started:

function convertMarkdownToHTML($text)

That'll be the skeleton of our function. To refresh your memory on what we'll be doing, I'll show you a small text with some markdown in it.

"This is *emphasized*, this is **bold**, this is _emphasized_, and this is also __bold__."

So what our function would need to do here, would be to replace the single asterisks and single underscores around the word "emphasized" (occurs twice) with the <em></em> tags. And it would also have to replace the double asterisks, and double underscores with <strong></strong> tags. Alright, let's start implementing this:

function convertMarkdownToHTML($text)
//Here we make sure that no HTML-tag tainted text will be a part of this
$text = htmlspecialchars($text, ENT_QUOTES, "UTF-8"); 

//We'll start by replacing double underscores with <strong></strong> tags
$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>', $text); 

To explain what we just did up there, I'll start by explaining the regular expression (/__(.+?)__/s). We open the regular expression with the first slash (/) and then we look for the first two underscores /__. Within the round brackets, we look for a minimum of one character and use the question mark to make sure the + isn't greedy. If it were greedy, it would match everything from the first double underscores, to the last, even if we had hundreds in the text we were performing this on. The reason for why we use the round brackets is also so that we can access it in the second argument of preg_replace with the variable $1. If you recall, anything within round brackets in a regular expression can be accessed with the variable $1, and in this case, it's whatever is between the double underscores. So we've covered this: /__(.+?). The last two underscores will match the closing double underscores, and the /s modifier will make sure that the dot (.) matches all characters (including /n).

So to sum it up: preg_replace('/__(.+?)__/s', '<strong>$1</strong>', $text); will search for and replace a pattern in the string $text. The string it replaces is everything between double underscores, and it replaces it with what's in between the underscores, surrounded by a couple of <strong> tags which will make it bold.

Now let's do the same for double asterisks, the code for that is identical except you have to escape all the asterisks you're looking for as they serve a special purpose for regular expressions. Here's the code:

$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>', $text);

This goes below the other line, or above, if you prefer. I'm just attempting to keep the code brief. Once we have the whole thing sorted out I will post the outcome.

Okay so, now that we've replaced all double underscores in the text, as well as double asterisks, we'll get started with the single asterisks and the single underscores.

	$text = preg_replace('/_([^_]+)_/', '<em>$1</em>', $text);
	$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);

The regular expressions here are only slightly different to the ones above. The first one, /_([^_]+)_/ is nothing we haven't looked at thus far however. The /_ part of it opens the regular expression and looks for the opening single underscore, we then open round brackets so we can refer to the content of the regular expression with the variable $1. Inside the round brackets, there's this: [^_]+ - this means that, what we're looking for in between the single underscores is a character (one or more) that isn't an underscore. We then look for the closing underscore and close the regular expression. The second expression is identical but again, we must escape the asterisks.

So that's it, we have a function that will change single & double asteriks/underscores to the appropriate html tags. Now let's move on to the next part of the markdown that we intend to support, and that is, the ability for the user to turn enter/newline into <br> (a linebreak) and <p> (a paragraph). So if he presses enter once, we will make it a linebreak in the HTML, but if he presses it twice, we will make it a paragraph. So what do we have to look for here? Well, \n and \n\n, that is, one and two 'presses' of the enter button. Since what we're looking for is just simple text, we can simply use str_replace. But there is a slight catch as I mentioned above: Some computers will use \r\n to symbolize a newline (Windows), others \r (old Macintosh machines), and then some \n. So what we can do to solve this, is to replace all the \r\n (and the \r) with \n. We can do that like this:

//Change Windows' \r\n to \n
$text = str_replace("\r\n", "\n", $text);
//Convert Macintosh \r to \n
$text = str_replace("\r", "\n", $text);

So this is it, all our different forms of newlines are being symbolized with the escape sequence \n. Now all we have to do is to change all the single newlines to <br>, and the double newlines to <p>. Before I forget, there is a reason for why we put \r\n before \r, and that is because, if we replace all the \r's before we replace the \r\n, all the \r\n will become \n\n, which isn't really what we want. But let's get to converting the newlines to breaks and paragraphs:

	//Convert two \n to a <p>
	$text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
	//Convert one \n to a <br>
	$text = str_replace("\n", '<br>', $text); 

There's maybe one thing that needs to be explained here. We place a paragraph opening and closing tag around the text because, well, for starters, we will want to make it a single paragraph by default. If we didn't do that and simply added paragraphs within the text on newlines, that means some portion of the text would be outside of the paragraph, while another pops up right in the middle of it. But as you may have noticed, we'll open the paragraph, and each time there are two newlines, we close the current paragraph and open a new one. At the end we finally close the last paragraph.

Now for the last part of our system, the hyperlink support. An example of the markdown for this would be: [A great description for the url]( This should then be turned to: <a href="">A great description for the url</a>.

This is the regular expression we'll use to find the URL: /\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&'()*+,;=%]+)\)/i

This may look quite intimidating but isn't quite that complicated. I'll try to break it down bit by bit:
/\[ - Looks for the opening square bracket. We escape it due to the square bracket's special effect in regular expressions.

We then place round brackets (()) on what's supposed to be inside the square brackets ([]) so that we can use $1 to use it later on. Inside the round brackets we have [^\]]+ which means that in this there has to be more than one character that is not a square closing bracket. After that we look for the closing square bracket. The reason for why we don't have to escape the closing square bracket is that there is no unescaped opening bracket that would work with it.

After that comes \( - which looks for the opening round bracket (around the URL).
Then we place round brackets around what's in there so that we can access it later by using the variable $2.

Inside the square brackets are ALL the characters that are allowed in a URL. Because they are inside square brackets, you do not have to escape the majority of the special characters. You do however have to escape the slash (/) or PHP will think you are ending the regular expression. You should also note that if you want a hyphen (-) to be matched, you must place it first in the square brackets. Otherwise PHP will assume that you're looking for a range of things, like in a-z or 0-9. Which is another thing we're looking for, any character from a-z, 0-9, a dot, underscore, and all the other characters allowed in a URL.

Finally we add \) which looks for the closing round bracket, and the modifier /i, which tells the regular expression to perform a case insensitive search. Now that we have the regular expression, let's apply the preg_replace:

	$text = preg_replace('/\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&\'()*+,;=%]+)\)/i', '<a href="$2">$1</a>', $text); 

	return $text;

One more thing about the regular expression there: We have to escape the single quote (') due to us using single quotes around the whole thing. As for the second argument of preg_replace, we get the content of the second round brackets and place it in the URL's position, and the content of the first round brackets we place inside the link text. But this is it, this is all we set out to achieve. So here's the full thing as promised:

function convertMarkdownToHTML($text)
	$text = htmlspecialchars($text, ENT_QUOTES, "UTF-8"); 
	$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>', $text); 
	$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>', $text); 
	$text = preg_replace('/_([^_]+)_/', '<em>$1</em>', $text);
	$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);
	$text = str_replace("\r\n", "\n", $text);
	$text = str_replace("\r", "\n", $text);

	$text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
	$text = str_replace("\n", '<br>', $text); 
	$text = preg_replace('/\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&\'()*+,;=%]+)\)/i', '<a href="$2">$1</a>', $text); 	
	return $text;

You can then apply this anywhere by including the file that contains this function, and echoing the output of it. Do note that there are far better markdown systems out there, that are more accurate, but I believe this might be a good place to start. I hope you found this to be helpful, and do check out Kevin Yank's book on PHP & MySQL if you get the chance. This is where I picked this up myself, and I find that book to be very clear and to the point on just about most things it covers. Thank you for reading through this. :)

Is This A Good Question/Topic? 1
  • +

Replies To: Simple Content Formatting with Regular Expressions

#2 Nublet   User is offline

  • D.I.C Head

Reputation: 1
  • View blog
  • Posts: 53
  • Joined: 12-October 12

Posted 28 December 2012 - 12:02 PM

Ready to go slay Regular Expressions :gun_bandana:

This post has been edited by Nublet: 28 December 2012 - 12:03 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1