Regular Expressions


Regular expressions, or ‘regex’ or ‘regexp’ for short, are an invaluable technique that can be used in a variety of ways. They may seem intimidating at first, especially for those without a degree in computer science. With a little practice, however, they can become one of the most useful resources in a user’s toolkit.

One of regex’s most interesting features is that once you’ve learned its syntax, you can use this tool in nearly all programming languages. That includes JavaScript, Java, VB, C#, C / C++, Python, Perl, Ruby, Delphi, R, Tcl, and many others. Users only need to be aware of slight distinctions related to the most advanced features, and in the syntax versions supported by the engines.

Let’s start by looking at some of the regex building blocks.

Starter kit

Literal matching

The first thing you should know about regular expressions is that everything is essentially a character. Therefore, users are basically writing patterns to match a specific sequence of characters (also known as a string).

Basic tasks are quite straightforward, such as searching for text that contains a specific string and or character. To do that, you would use literal character matching. Here are some examples:

  • a - Matches any string that has the character a in it

and similarly

  • abc - Matches any string that has the character sequence abc in it

NOTE! In order to be taken literally, the characters ^.[$()|*+? need to be escaped with a backslash \ as they have special meaning.

  • abc\? - Matches a string that has a ? character right after the character sequence abc -> Try it!

Meta sequences

Although using literal character matching is very useful in a variety of situations, it can get messy and confusing when long and complex strings are involved. To simplify the regex patterns and make them more compact, there are various meta sequences and character classes defined. Below is a list of some of the more commonly used ones.

The most interesting is called the wildcard, which is represented by the . (dot) meta character.

  • . - Matches any character (letter, digit, whitespace, everything) -> Test it out!

NOTE! Use the . operator sparingly as it can cast a net that’s too broad in some cases. Character classes or negated character classes (which we’ll cover next) are faster and more precise.

Here are some meta sequences for the most common character classes:

  • \d - Matches a single character that is a digit -> Test it out!

  • \w - Matches a word character (alphanumeric character plus underscore) -> Test it out!

  • \s - Matches a whitespace character (includes tabs and line breaks) -> Test it out!

  • \d , \w and \s also have their negations: \D, \W and \S respectively.

For example, \D will perform the inverse match with respect to that obtained with \d.

NOTE! Users can also match non-printable characters, such as tabs \t, new-lines \n or carriage returns \r.

OR operator - [] or |

Sometimes it is convenient to be able to define your own character classes. This can be achieved by using the square brackets:

  • [abc] - Matches a string that contains a, b or c -> Test it out!

The same results are generated with the | operator: a|b|c

You can also construct groups of different sequences with |:

  • (materials|portal)\.rangeforce\.com - Matches a string that contains either materials.rangeforce.com or portal.rangeforce.com -> Test it out!

Bracket expressions - []

The square brackets offer additional functionality for defining more complex character groups by using different ‘expressions’, such as ranges of characters or negation.

  • [a-c] - Matches a string that contains a, b or c

  • [a-fA-F0-9] - Matches a string that represents a single hexadecimal digit, case-insensitively -> Test it out!

  • [a-zA-Z0-9_] - This is basically how the word character (\w) is defined.

  • [0-9]% - Matches a string that has a character from 0 to 9 before a % sign

  • [^a-zA-Z] - Matches a string that doesn’t have a letter from a to z or from A to Z. In this case, the ^ is used as a negation of the expression -> Test it out!

NOTE: Inside bracket expressions, all special characters (including the backslash ) lose their special powers. Therefore, we will not apply the “escape rule”.

Quantifiers - * + ? and {}

You’re probably wondering what the matching of varied-length sequences looks like. Does it involve defining a pattern for each individual sequence (with its different length) and using Boolean operators such OR to join them together? Wouldn’t the resulting pattern be so as to be nearly incomprehensible?

Thankfully, there’s a better way. It involves a concept called quantifiers.

Let’s look at some examples of the most used quantifiers:

  • abc* - Matches a string that has ab followed by zero or more c -> Test it out!

  • abc+ - Matches a string that has ab followed by one or more c

  • abc? - Matches a string that has ab followed by zero or one c

  • abc{2} - Matches a string that has ab followed by 2 c

  • abc{2,} - Matches a string that has ab followed by 2 or more c

  • abc{2,5} - Matches a string that has ab followed by 2 up to 5 c

  • a(bc)* - Matches a string that has a followed by zero or more copies of the sequence bc

  • a(bc){2,5} - Matches a string that has a followed by 2 up to 5 copies of the sequence bc

A quantifier by default tells the regex engine to match as many instances as possible. This behavior is called greedy. For example, if a user applies the + quantifier to a \d token (\d+), the engine matches one or more digits. If the user feeds the engine a string such as 12345, it will match as many digits as possible, i.e. 12345, despite the fact that “one or more digits” could also be 1, 12, 123 or 1234 in this case.

Although the quantifiers are greedy by default, they are also docile. Meaning, in case of the string 12345, a pattern like \d+45 would still match the string. . That’s because when the quantified token \d+ has matched the whole string and the rest of the pattern (45) can’t match because of that, the engine will start to backtrack token-by-token until the pattern matches.

Anchors - ^ $ and \b

You have now been introduced to most of the building blocks you’ll need to construct rather complicated regex patterns that will find strings containing a certain substring. But suppose a user wants to match a string only if it matches the pattern exactly. Or maybe the user only needs to match a string that starts with The, or ends with end. Or perhaps there’s a need to match exact separate words in text. Again, regex can handle it. There are multiple special tokens for this purpose called anchors. Here are some of the more interesting and useful ones:

  • ^The- Matches any string that starts with The -> Test it out!

  • end$ - Matches a string that ends with end

  • ^The end$ - This is an exact string match (matches the string The end exactly)

  • \bword\b - This erforms a “whole words only” search -> Test it out!

Matching positions where on one side, there’s a word character (like \w) and on the other, there’s is a non-word character (like the beginning of the string or a space character), \b is similar to $ and ^.

This anchor does come with its negation, \B. This matches all positions where \b doesn’t match and could be used if we want to find a search pattern fully surrounded by word characters.

Advanced techniques

Grouping and capturing — ()

When writing regex patterns for complicated problems, things can get complicated very quickly. Dividing the problem into smaller chunks helps to organize the process. Users are well-advised to tackle their problems step-by-step. That means grouping different parts of the regex pattern and later, joining them together to form the final solution. Users can always use the captured/matched values of those groups later if need be.

  • a(bc) - Parentheses create a capturing group with value bc -> Test it out!

  • a(?<foo>bc) - This adds a name to the group by using ?<foo> -> Test it out!

  • a(?:bc)* - This disables the capturing group by using ?: -> Test it out!

This grouping mechanism is very useful when a user needs to extract information from strings or data using his or her preferred programming language.

If we choose to put a name to the groups (using (?<foo>...)), we will be able to retrieve the group values using the match result like a dictionary where the keys will be the name of each group.

Back-references — \1

  • ([abc])\1 - \1 - Matches the same text that was matched by the first capturing group -> Test it out!

  • ([abc])([de])\2\1 - Enables the use of \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group -> Test it out!

  • (?<foo>[abc])\k<foo> we named the group foo and we reference it later (\k<foo>). The result is the same as the first regex’s -> Test it out!

NOTE: When referencing different parts of the match outside of the regex pattern (for substitutions, or when using the captured values as variables, etc.) the syntax depends on the programming language. For example, in Perl we can use $+{group} to get the value captured by a group named group for a substitution.

Flags

When learning about how to construct a regex pattern, don’t forget about the fundamental concept of flags.

A regex usually comes within this form: /abc/, /, where the search pattern is delimited by two slash characters (/). At the end, the user can specify a flag with these values (and also combine them with one another):

  • g (global) - This does not return after the first match, restarting the subsequent searches from the end of the previous match

  • m (multi-line) - When enabled ^ and $ will match the start and end of a line, instead of the whole string

  • i (insensitive) - This makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

There’s an option to try out different flags in regex101.com:

More advanced techniques

Next up, there are additional, advanced RegEx techniques to cover. The following concepts are more complex and tend to be a bit confusing. However, mastering them will enable users to create patterns that would otherwise be either impossible or very difficult for them to create.

NOTE: These constructs are only supported by a few regex engines.

Lookarounds

The first of these more advanced techniques is one called lookarounds. This technique lets users match characters without actually “consuming” them. This means that after the lookahead’s or lookbehind’s closing parenthesis, the regex engine is stopped at the very same spot in the string from which it started looking.

Here are the four different types of lookarounds:

  • \d+(?= dollars) - Positive lookahead. \d+ - Matches one or more digits, then (?= dollars) asserts that the string ” dollars” immediately follows the current position (the position of the engine after matching those digits). -> Test it out!

  • \d+(?!\d| dollars) - Negative lookahead. \d+ - Matches one or more digits, then (?!\d| dollars) asserts that neither a digit nor the string ” dollars” immediately follows the current position -> Test it out!

  • (?<=USD)\d+ - Positive lookbehind. (?<=USD) - Asserts that the string “USD” immediately precedes the current position, then if the assertion succeeds, the engine matches one or more digits with \d+ -> Test it out!

  • (?<!\d|USD)\d+ - Negative lookbehind. (?<!\d|USD) - Asserts that neither a digit nor the string “USD” immediately precedes the current position, then if the assertion succeeds, the engine matches one or more digits with \d+ -> Test it out!

Additionally, users can chain multiple lookarounds together. For example, suppose we have a list of strings written in CamelCase, such as “MayTheRangeForceBeWithYou”. To make it easier to read, we would like to split those strings and insert a space at each position between a lowercase and an uppercase letter.

We can use a combination of a positive lookbehind and a positive lookahead to construct a pattern known as a zero-width match. That would look like the following:

Here (?<=[a-z]) asserts that a lowercase Latin letter immediately precedes the current position. In addition, (?=[A-Z]) asserts that an uppercase Latin letter immediately follows the current position. Since the regex engine hasn’t moved on the string after processing those assertions, it is able to exactly match each position between a lowercase and an uppercase letter without matching any actual characters. Hence the name “zero-width match”. The last required step is replacing those positions with a whitespace character.

Conditionals

Next, we’ll cover a special, yet familiar construct called if-then-else conditionals. This technique allows the user to construct conditional regular expressions, where lookahead and lookbehind can be used for the if part.

  • this is (?(?=t\w{3})true|false) - Attempts to match this is , then checks with a positive lookahead, if the following characters are t and 3 word characters. If the condition evaluates to true, the regex engine will attempt to match true literally. Otherwise, it will attempt to match false literally -> Test it out!

It is important to remember that the lookaround does not consume any characters, thus the engine will attempt to match the then or else part from the same position where the if was checked.

As an alternative, capturing groups can also be used for the if part of the conditional expression.

  • (a)?b(?(1)c|d) - Due to the optional capturing group (a)?, the engine attempts to match a literal a. Then, it tries to match b. After that, the conditional expression (?(1)c|d) checks whether the first capturing group (a) has taken part in the match so far. If it has, the engine will attempt to match a literal c. Otherwise, it will attempt to match a literal d -> Test it out!

It is possible to use any regular expression for the then and else part. If you want to use the | OR operator, you will have to group those parts together using parentheses:

(?(?=condition)(then1|then2|then3)|(else1|else2|else3))

Practical examples

You survived the theory section. Great job! Here are some practical and commonly used regex examples as a reward:

Checking for double words:

\b(\w+)\s+\1\b

Validating a time string in various formats, such as 10:22, 21:10, 08h55 and 7.15

^(?:(?:0?\d|1[012])|(?:1[3-9]|2[0-3]))[.:h]?[0-5]\d$

Extracting information from an HTML anchor tag

<a[^>]*href="(.+)"[^>]*>(.*)<\/a>

The value of the href parameter is captured in the first capture group (can be accessed with \1) and everything between the opening and closing tags is captured into the second backreference (which can be accessed with \2)

Matching an IPv4 in text

\b(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
  (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b

Here (25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) matches a number in range 0..255 and doesn’t allow leading zeros. Also, each number of the IP address is stored in a capturing group. User can tap into these groups later on. For example, a user can use $& in Perl to get the entire matched IP and $+ or \4 to get the last number of the IP.

References

Obviously, there are many different regex engines. Although most of the basic techniques are be more or less the same across all of them, they may still have differences when it comes to more advanced or unique syntax. Users should try working with different engines to see which one suits them best. For more information about the various and engines their characteristics, visit:

Here are some other valuable resources for learning regex:

Documentation for Perl regex:

Conclusion

As you’ve seen, regular expressions aren’t so regular after all. They can be applied to a range of problems, such as information scraping, data validation, string parsing, string substitution, etc.

In the early stages of learning regular expressions, it can seem like learning a foreign language, or going into areas in which only experienced mathematicians can excel. However, if you approach it gradually - read, look at examples and try out different patterns for yourself – you’ll surely catch on eventually.

Whether you’re just starting with regex or you’re a die-hard regex enthusiast, this blog post should provide you with some new tools and helpful tips for improving your ability to write regex patterns.

Mihkel Kruusi