And here we are, the final stretch of our coding adventure with regular expressions in PHP! π
The theme of this chapter is simple: practice. Aside from a few bits we'll cover at the start, you already know the essentials of regex. What you're missing now is the most important part: hands-on experience!
In the second half of this chapter, we're going to build some regex patterns together, so you can see how to finally wrangle that regex into submission! π Writing a tiny piece of regex like we've done so far is one thing, but crafting a full, working regex? That's a whole different beast!
A Word About Metacharacters
Before we dive in, I want to add a new word to your regex vocabulary: metacharacters.
Nope, it's not programmer slang β it simply means "special characters." These are characters that aren't interpreted literally but have a specific function in regex.
The metacharacters are escaping... why?
In PCRE (PHP's regular expression flavor), the main metacharacters to know are:
# ! ^ $ ( ) [ ] { } ? + * . \ |
You'll want to commit these to memory. Most of them, you've already seen.
For instance, the dollar sign $
is special because it denotes the end of a string. Same goes for the caret, the hash, parentheses, brackets, curly braces, and symbols like ? + *
: we've used all of these in the previous chapter, remember?
The dot .
and backslash \
are new to you, but don't worry β we'll get to them shortly.
Now, these characters all have a specific meaning.
Well, the problem shows up the day you want to search for something like "What?" in a string.
#What?#
?Nope, absolutely not! Remember β the question mark ?
means the character just before it is optional (it can appear 0 or 1 time).
In that example, it would make the "t" before the question mark optional, which is not what we want!
?
has a job to do?We need to escape it. That means placing a backslash \
in front of the special character. The correct regex would be:
#What\?#
Here, the backslash says: "Hey, this ?
is not special β it's just a plain old question mark!"
(# ! ^ $ ( ) [ ] { } ? + * . \ |)
: If you want to use them literally in your pattern, just pop a backslash in front. And yes, to use an actual backslash \
, you'll need... another backslash! Like this: \\
.Twisty, isn't it?
But the rule is simple enough: If you want to match a special character, escape it with a backslash. Period.
Let me give you a few examples to help this sink in:
String | Regex | Match? |
---|---|---|
I'm excited! | #excited \!# | TRUE |
I'm (very) tired | #\(very\) tired# | TRUE |
I'm sleepy... | #sleepy\.\.\.# | TRUE |
That smiley :-\ | #:-\\# | TRUE |
The Case of Character Classes
Okay, maybe I'm starting to lose you here (I get it π ), but stick with me a little longer. If you want to actually use regex in real projects, you really do need to know this stuff.
There's one last little exception we have to talk about: character classes.
So far, you've used things like this:
#[a-z0-9]#
But as you probably guessed, you can put other characters in those brackets, like accented letters (though you'll need to list them individually): [a-zéèà ΓͺΓ’ΓΉΓ―ΓΌΓ«]
and so on...
That's all good. But what if you want to include special characters inside a class? Like, say, a question mark?
Here's the twist: inside brackets, metacharacters lose their powers! You don't need to escape them.
So this regex is totally valid:
#[a-z?+*{}]#
It means: "allow a letter, a question mark, a plus sign, etc."
But β of course β there are three exceptions:
#
(hash): This one still marks the end of a regex in PHP. So even inside brackets, it needs to be escaped with a backslash.]
(closing bracket): Normally it closes the class, so if you want to include it in the list, you must escape it.-
(hyphen): This defines a range (like [a-z]). If you want to match a literal dash, just place it at the start or end of the class, like [a-z0-9-].
Shorthand Regex Classes
Good news: you now have everything you need to write almost any regex you want! Bad news: I said almost. π
Don't worry, this next bit won't hurt (and honestly, by now you're probably numb anyway). I just want to introduce you to something called shorthand classes β or, as I prefer to say: shortcuts.
You might not use all of them right away, but chances are you'll run into these someday. Better to know now than be blindsided later. π
Here's the cheat sheet:
Shortcut | Meaning |
---|---|
\d | A digit (same as [0-9]) |
\D | Anything but a digit (same as [^0-9]) |
\w | A word character (alphanumeric + underscore). Equals [a-zA-Z0-9_] |
\W | Anything but a word character (same as [^a-zA-Z0-9_]) |
\t | A tab character |
\n | A newline |
\r | A carriage return |
\s | A whitespace character (\t, \n, or \r) |
\S | Anything but a whitespace character |
. | Any character at all! (Well... almost) |
These are just regular letters, but when you add a backslash in front, they take on a special meaning.
So this is actually the opposite of what we did earlier β before, we escaped special characters to make them behave normally. Now, we're giving normal characters a special meaning.
And one note about the dot .
β it matches any character except newlines (\n).
To make it truly match everything, including line breaks, you need to activate the s
modifier in PCRE:
#[0-9]-.#s
Alright, now you're finally ready β we can move on to the fun part: practice! π
Building a Full-Fledged Regex
You're finally going to understand why this whole thing has been such a headache! π
This time, we're diving into real, practical examples you'll definitely find useful. We're going to build a big, complete regex together so you get the method. After that, you'll be totally able to whip up your own regexes and use them in your PHP scripts! π
Validating an Email Address
It'd be a shame to stay stuck in theory β especially with a topic as tricky (let's not be shy) as regular expressions.
So let me show you a real-world example that'll definitely come in handy: checking whether an email address is valid.
But before we dive in, let's make sure we're on the same page. Here's how an email address is built:
- First, there's the username (at least one character, but rarely just one). It can include lowercase letters (no uppercase!), numbers, dots, hyphens, and underscores _.
- Then comes the at sign:
@
- Next is the domain name. Same rules as the username: lowercase letters, numbers, hyphens, dots, and underscores. The only catch β you might not guess this β is that it must have at least 2 characters. So something like a.com is invalid, but aa.com works.
- Finally, there's the extension (like .com). This part always starts with a dot, followed by 2 to 4 lowercase letters. Think .com, .net, but also .biz, .jp, .org, .info, and so on...
So an example of a valid email could be: siterawfan@siteraw.com
Let's build the regex step by step:
- First off, we only want the email address, so we specify the start and end of the string:
#^$#
- Next, we want one or more lowercase letters, numbers, dots, underscores, or hyphens. That's the character class
[a-z0-9._-]
followed by+
to say "at least one":#^[a-z0-9._-]+$#
- Now the at sign. Easy, just add the
@
character:#^[a-z0-9._-]+@$#
- Then, we want another string of lowercase letters, numbers, etc., this time with at least 2 characters. We use
{2,}
to say "2 or more":#^[a-z0-9._-]+@[a-z0-9._-]{2,}$#
- After that, comes the dot (like in .com). Now remember: a plain dot means "any character" in regex. But here we want a literal dot, so we escape it with a backslash:
#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.$#
- Finally, we need 2 to 4 lowercase letters β no digits, no symbols, just letters:
#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.[a-z]{2,4}$#
And there you have it! A complete, working regex! π
#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.[a-z]{2,4}$#
Starting to feel like you're reading ancient hieroglyphs yet? π
Since I'm in a good mood, here's a full PHP script to test this regex:
<p> <?php if (isset($_POST['email'])) { $_POST['email'] = htmlspecialchars($_POST['email']); // Neutralize any HTML the visitor might've enteredif (preg_match("#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.[a-z]{2,4}$#", $_POST['email'])) { echo 'The address ' . $_POST['email'] . ' is <strong>valid</strong>!'; } else { echo 'The address ' . $_POST['email'] . ' is not valid, try again!'; } } ?> </p>
<form method="post"> <p> <label for="email">Your email?</label> <input id="email" name="email" /><br /> <input type="submit" value="Check Email" /> </p> </form>
Try testing addresses like:
- siterawfan@siteraw.com
- bigbossman@hotmail.com
- lol@facebook.com
- bill@microsoft.com
- ...and so on!
So, what do you think? Pretty cool, right? π
I admit, regexes look like total madness when you first lay eyes on one. I can only imagine the look on your face back in the previous chapter's intro when I showed you a sample. π
But look at how far we've come! Together, we've written one of those seemingly incomprehensible things β and I bet most of you never thought you'd be able to do that when reading the previous chapter.
And yet, here we are. We've successfully built a complete regex. I won't make you write another one β at this point, you've got the hang of it, and you're ready to fly solo.
I just want to show you one last cool thing before we move on to our final important concept: Capturing and Replacing.
Regex... with MySQL!
See? All that struggling to get here is finally paying off. π
Yes indeed β drumroll please β MySQL supports regex!
And that, my friend, is all upside for you: you've just learned how to write regexes, and you hardly need to learn anything new to use them with MySQL.
There's just one catch: MySQL only supports POSIX-style regexes, not PCRE (which we've been learning).
Hey now, calm down. π
I taught you PCRE because it's way faster and it's almost identical to POSIX anyway.
Here's what you need to know for POSIX regexes:
- No delimiters or options. So no
#
signs around the regex. - No shorthand classes like
\d
,\w
, etc. But you can still use the dot.
for "any character."
Let's look at an example. Say you've stored visitor IP addresses in a table called visitors, and you want to find all visitors whose IP starts with 192.178:
SELECT name FROM visitors WHERE ip REGEXP '^192\.178(\.[0-9]{1,3}){2}$'
What does this do? It selects all names from the visitors table where the IP starts with 192.178 and ends with two other numbers, each 1 to 3 digits long (like 192.178.163.102).
That's the power of regex in an SQL query β super precise searches... what's not to love? π
I won't go deeper here β you now have the tools to figure it out if you ever need it.
Now, onto the final major concept in regex: Capturing and Replacing!
Capturing and Replacing in PHP
At the start of these two chapters on Regex, I told you they're great not just for powerful searching (which we just saw with our email example), but also for search and replace.
That's going to let us do cool things like:
- Detect email addresses in a message left by a visitor.
- Automatically turn each email into a clickable link using
<a href="mailto:someone@example.com">
, making those emails clickable!
This same technique also works to make https://
links clickable too. And soon, you'll even learn how to create your own simplified markup language β like the famous bbCode used in most forums ([b][/b]
to make text bold β ring a bell? π)
Capturing Parentheses
Everything from here on out revolves around parentheses. You've already used them to group parts of a Regex and repeat them (like we did with the phone number). Well, that's their first use β but not the only one!
From now on, we'll be using the preg_replace function.
This is the function that lets us capture text using parentheses.
Here's the trick: every time you use parentheses, PHP automatically creates a "variable" that stores what was inside them.
Let me explain with this Regex:
#\[b\](.+)\[/b\]#
You should be able to read this pretty easily by now. It says: "Find a [b], followed by one or more characters (the dot .
means 'any character'), followed by a [/b]".
\
before the square brackets so PHP doesn't confuse them with character classes (like [a-z]
).Now, if you think for a second, you'll realize the parentheses aren't strictly necessary here. And you're right β if we were just searching, we wouldn't need them.
But since we want to replace, those parentheses are going to come in super handy.
Here's the key idea: every time you use parentheses, PHP creates a variable:
- $1 for the first set,
- $2 for the second, and so on...
We then use those variables in the replacement string.
In the Regex above, we've only got one pair of parentheses, right? So PHP will create one variable: $1, which will contain the text between [b] and [/b]. That's what we'll put in bold! π
All right, enough theory β let's get to the fun part. Here's how you bold any text surrounded by [b][/b]:
<?php $text = preg_replace('#\[b\](.+)\[/b\]#i', '<strong>$1</strong>', $text); ?>
Let's break down how preg_replace works:
- First parameter: the Regex. Nothing special, just remember each set of parentheses becomes a variable ($1, $2, etc.).
- I added the i option to make it case-insensitive β so it'll match [B][/B] too.
- The second parameter is the replacement: "<strong>$1</strong>" (remember, <strong> makes text bold in HTML).
- Inside the HTML tags, we use $1. That means whatever was inside [b]...[/b] will now be wrapped in <strong>.
- And finally, the third parameter is the text we're searching and replacing in (you already know this part).
preg_replace returns the new version of the text, after the replacements are done.
A few rules you'll need to know.
If you have multiple parentheses, just count them left to right to get their variable numbers.
For example: #(site)ra(w)(the(best)site)raw#
This Regex has 4 capture groups: $1, $2, $3, and $4.
- $3 captures "thebestsite"
- $4 captures "best"
It's the order in which the parentheses open that determines the numbering.
You can use up to 99 capture groups in a single Regex (plenty of room, right?), so variables go up to $99.
$0 is always automatically created. It contains the entire match. In the example above, $0 would be "siterawthebestsiteraw".
Don't want a capturing group? Start it with ?:
. Like this: #(site)ra(?:w)(the(best)site)raw#
The second group ((?:w))
won't be captured. That leaves us with 3 capture variables (or 4 if you count $0):
- $0: siterawthebestsiteraw
- $1: site
- $2: thebestsite
- $3: best
If you've wrapped your head around this β you're crushing it! π
Create Your Own bbCode
Now it's time to get practical and put capturing parentheses to work.
We're going to build something called a parser.
This parser will take a message written by a visitor (maybe on a forum, in a guestbook, or even in a mini-chat), and:
- Sanitize any dangerous HTML using htmlspecialchars
- Replace bbCode tags with proper HTML equivalents!
We won't do every possible bbCode (too much work), but these are enough to get us rolling:
- [b][/b]: bold text
- [i][/i]: italic text
- [color=red][/color]: color the text (we'll let users pick from a few colors)
We'll also auto-link any URLs that start with https://
. π
Let's start with [b] and [i] (same structure for both).
You've already seen the code for [b], and it's almost right. But we're missing a few options. To make it work correctly, we need to add three flags:
i
: case-insensitive matching ([B], [b], etc.)s
: lets the dot (.) match newlines (so bold text can span multiple lines)U
: this one's new! It stands for "Ungreedy".
Without the last one, the Regex might be too greedy and match too much.
Example: This is [b]important[/b], you must [b]understand[/b]!
- Without the U flag, the Regex would bold everything between the first [b] and the last [/b], which we don't want!
- With U, it stops at the first closing tag. Perfect.
Here's the final code for bold and italic bbCode:
<?php $text = preg_replace('#\[b\](.+)\[/b\]#isU', '<strong>$1</strong>', $text); $text = preg_replace('#\[i\](.+)\[/i\]#isU', '<em>$1</em>', $text); ?>
As you can see, [b] and [i] work almost the same β only the HTML tag changes (<em>
instead of <strong>
).
Still with me? Great. Let's level up: time for [color=...]
.
We'll let users choose from a few color names using the | (OR) symbol, and use two capture groups:
- One to capture the color (in English, so we can use it directly in HTML)
- One to capture the text between
[color=...]
and[/color]
Here's the code:
<?php $text = preg_replace('#\[color=(red|green|blue|yellow|purple|olive)\](.+)\[/color\]#isU', '<span style="color:$1">$2</span>', $text); ?>
So if someone types [color=blue]text[/color]
, it'll show the word text in blue. Try the other colors too! π
Last Step: Clickable URLs
Now let's auto-convert all https://
links into clickable anchors.
Try writing the Regex yourself β you totally can!
Here's the answer:
<?php $text = preg_replace('#https://[a-z0-9._/-]+#i', '<a href="$0">$0</a>', $text); ?>
In the replacement, I used $0 β which, if you remember, holds the entire matched text (in this case, the full URL).
No need for the s
or U
flags:
- URLs never have line breaks.
- "Ungreedy" isn't helpful here (in fact, try it β you'll see it cuts off the URL too soon!)
I kept the Regex simple. Sure, we could make it more precise β but I don't want to overload you now.
Plus, I'd rather you improve it yourself. π
- For instance, the current Regex works fine for something like:
https://www.siteraw.com/php/build-a-dynamic-website
- But not for URLs with parameters like:
https://www.siteraw.com/search?q=siteraw
I'll leave that as your challenge. π
You know what? Maybe your head's spinning by now β but imagine how mine feels: my head hurts AND my fingers ache! π
But I'll give one last push β here's a bonus gift:
<?php if (isset($_POST['text'])) { $text = stripslashes($_POST['text']); // Remove automatic slashes $text = htmlspecialchars($text); // Neutralize any HTML tags $text = nl2br($text); // Convert newlines to <br />// Run our Regex-powered parser $text = preg_replace('#\[b\](.+)\[/b\]#isU', '<strong>$1</strong>', $text); $text = preg_replace('#\[i\](.+)\[/i\]#isU', '<em>$1</em>', $text); $text = preg_replace('#\[color=(red|green|blue|yellow|purple|olive)\](.+)\[/color\]#isU', '<span style="color:$1">$2</span>', $text); $text = preg_replace('#https://[a-z0-9._/-]+#i', '<a href="$0">$0</a>', $text);
// And display the result. Ta-da! π echo $text . '<br /><hr />'; } ?>
<p> Welcome to the SiteRaw.com parser!<br /> We built this parser together β I hope you're proud, and you'll see how useful everything you just learned is! </p>
<p>Go ahead, try out some bbCode! For example, type:</p>
<blockquote style="font-size:0.8em"> <p> I'm a total [b]NOOB[/b], and yet I [i]learned it all[/i] on https://www.siteraw.com<br /> I [b][color=green]highly recommend[/color][/b] this site β you'll be able to learn how to do this [i][color=purple]too[/color][/i]! </p> </blockquote>
<form method="post"> <p> <label for="text">Your message?</label><br /> <textarea id="text" name="text" cols="50" rows="8"></textarea><br /> <input type="submit" value="Show me the true power of Regex!" /> </p> </form>
Phew! If this doesn't help you build a killer website... I don't know what will. π
A Few Final Challenges
In case you're hungry for more, here are some ideas to improve the parser:
- I mentioned earlier: try making clickable URLs work even when they contain parameters like:
https://www.siteraw.com/search?q=siteraw
- Also, try parsing email addresses and turning them into
mailto:
links. - Extend the bbCode with tags like
[u]
,[img]
, and more. - Don't like square brackets? Invent your own syntax! Like
{bold}
instead of[b]
π
And if you're loving Regex, here's one last challenge...
Write a function that automatically colorizes HTML code!
It should:
- Run htmlspecialchars on the code
- Then wrap tag names in dark blue, attributes in light green, and quoted strings in orangish red using
<span style="color:...">
Yeah, good luck... you'll need it!
Enjoyed this PHP & MySQL course?
If you liked this lesson, you can find the book "How to Build a Website in HTML and CSS" from the same authors, available on SiteRaw, in bookstores and in online libraries in either digital or paperback format. You will find a complete PHP & MySQL workshop with many exclusive bonus chapters.