Tal Brenev
May 26, 2021

Regular expressions are an incredibly powerful tool. It is no wonder that so
many people have flocked to internet forums to ask, “can I use regex to parse
HTML”? According to this legendary Stack Overflow
post, the answer is a resounding *no*. In
fact, the post implies that attempting to parse HTML with regex will summon a
Lovecraftian entity known as Z̷͎̐a̶͉̓l̵̟͛g̷̺̐ȏ̸̙, and should therefore be avoided at all
costs.

“A mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming.”

— Stack Overflow User

Numerous reasons are given in the Stack Overflow thread for why parsing HTML with regex is a no-go. Intimidating terminology such as “Chomsky Type 2 grammar” is thrown around. They claim that it is “mathematically impossible”, with one reply comparing it to dividing by zero. Looks like there’s nothing more to see here. After all, if the top replies on Stack Overflow say it’s impossible, then it must be impossible, right?

Well yes, but actually no.

Don’t get me wrong. Using regex to parse HTML is a very, very bad idea.
However, the reasoning in that Stack Overflow thread is not entirely sound in
the context of modern programming languages. In this blog post, I’ll explain
why, in some programming languages, if you really *really* wanted to, you could
write a regex to match HTML.

To clear up the confusion surrounding regexes and HTML, we must first figure out what “regular expression” actually means. Unfortunately, this term can mean two different things depending on context. On the one hand, there are “regular expressions” as they appear in modern programming languages and libraries. On the other hand, there is the formal concept of “regular expressions” as it appears in computer science, and in particular the study of languages. The latter notion is the one which causes so many people to believe that parsing HTML is not possible with regular expressions. To see why, we must dive into the formal theory of languages.

For our purposes, a *language* is simply a set of strings. The set of all email
addresses is a language. So is the set of all phone numbers. Importantly, a
language may be infinite; that is, it may contain infinitely many strings. For
example, the set
$\{\varepsilon,\>\text{A},\>\text{AA},\>\text{AAA},\>\text{AAAA},\>…\}$
is an infinite language (here, we use $\varepsilon$ to represent the empty
string). We want a simple way to completely describe a language, even if it is
extremely large or infinite. The previous example can be described informally
as “the set of strings which consist of $\text{A}$ repeated zero or more
times”. This description, which is itself finite, completely describes an
infinite set of strings.

For complex languages, however, informal descriptions are too verbose. Furthermore, we want a computer to automatically check whether or not a string belongs to a language. Since computers cannot understand natural language (yet), informal descriptions will not suffice.

Enter *regular expressions*. Just like the regular expressions used in
every-day programming, formal regular expressions are used to concisely
describe some large set of strings. In formal language theory, regular
expressions consist of:

- “Alphabet” symbols, such as $\text{A}$ or $\text{B}$
- The $+$ symbol, which means “or”
- The $\cdot$ symbol, which means “concatenate”
- The Kleene star, $*$, which means “repeat zero or more times”
- Parentheses, $($ and $)$, for grouping

(For simplicity, I am skipping over some subtler parts of the formal definition; you can find a complete definition here.)

Here are a few examples of formal regular expressions, along with the languages they describe:

Formal Regex | Language |
---|---|

$\text{A}+\text{B}$ | $\{\text{A},\>\text{B}\}$ |

$\text{A} \cdot \text{B} \cdot \text{C}$ | $\{\text{ABC}\}$ |

$\text{A} \cdot \text{B} \cdot (\text{C} + \text{D})$ | $\{\text{ABC},\> \text{ABD}\}$ |

$\text{A}^*$ | $\{\varepsilon,\>\text{A},\>\text{AA},\>\text{AAA},\>\text{AAAA},\>…\}$ |

$(\text{A} \cdot \text{B})^*$ | $\{\varepsilon,\>\text{AB},\>\text{ABAB},\>\text{ABABAB},\>…\}$ |

$\text{A} \cdot \text{B}^*$ | $\{\text{A},\>\text{AB},\>\text{ABB},\>\text{ABBB},\>…\}$ |

It is apparent that these regular expressions are not nearly as powerful as the
regular expressions built into modern programming languages. At the very least,
formal regexes are far more verbose. Whereas in a modern regex you can simply
write something like `[1-4]*`

, a formal regex would require you to write
$(1+2+3+4)^*$. Nevertheless, formal regexes are perfectly capable of
describing simple languages such as phone numbers.

If a language can be described by a formal regular expression, we call it a
*regular language*. In a perfect world, all languages would be regular. But
this is not a perfect world. Consider the following language:
$\{\varepsilon,\>\text{AB},\>\text{AABB},\>\text{AAABBB},\>…\}$.
Let’s call this language $L_1$. This language contains all strings which consist of
$\text{A}$ repeated $n$ times, followed by $\text{B}$ repeated $n$ times.
Although it is very simple to describe $L_1$ informally, it turns out
that it is not a regular language! There does not exist any regular expression
which can describe $L_1$ (we can prove this by using something called
the pumping
lemma).

A language which cannot be described by a regular expression is called
*non-regular*. In general, regular expressions fail to describe languages which
require an arbitrary amount of memory to parse. To see what I mean, let’s take
a look at the following regular expression:

This describes the set of strings consisting of the letter $\text{A}$, repeated no more than 3 times. If we wanted to check whether a string is in this language, we would need to count the number of $\text{A}$s it contains; as soon as this count exceeds 3, we can stop and conclude that the string is not in the language. This process only ever requires a fixed amount of memory: we know that our count will never need to grow past a certain limit.

On the other hand, to check if a string is in our non-regular language $L_1$, we need to count the number of $\text{A}$s and the number of $\text{B}$s, and check that those two numbers are equal. If the string starts with 100 $\text{A}$s, we would need enough memory to count up to 100; if the string starts with 1,000,000 $\text{A}$s, we would need enough memory to count up to 1,000,000. Larger numbers require more memory, so we cannot put a hard limit on the amount of memory required in advance. In other words, parsing strings in $L_1$ requires an unlimited amount of memory!

This gives us a good rule of thumb for figuring out whether a language is regular or not. Given some language, come up with an algorithm for checking whether a string is in the language. If the algorithm takes a fixed, finite amount of memory, independent of the size of the input string, then the language is regular. If you try very hard and cannot come up with such an algorithm, then it is very likely that the language is non-regular (although you would still need to use other methods to prove this with certainty).

As another example, we can look at the set of all palindromes. One way to check whether a string is a palindrome is to reverse it, and check that it is equal to its own reverse. This process requires storing a reversed copy of the string in memory. Since the string can be arbitrarily large, we require an arbitrarily large amount of memory. Any algorithm that I can come up with for determining whether a string is a palindrome runs into this same issue. By our rule of thumb, the set of palindromes is a non-regular language; this is indeed a fact which can be formally proved.

So what about HTML?

`<p>Hello World!</p>`

is valid HTML. `<p><p>Hello World!</p></p>`

is also valid
HTML. On the other hand, `<p><p>Hello World!</p>`

is not valid HTML: there are
two opening `<p>`

tags, and only one closing `</p>`

tag. Even in this very
specific case, we need to store a count in memory to check whether something is
valid HTML: we need to count the number of opening `<p>`

tags, count the number
of closing `</p>`

tags, and check that those two numbers are equal. More
generally, to parse HTML, we need to keep track of all opening tags in memory,
and later check that each one has a corresponding closing tag. There can be an
arbitrary number of tags, so we need an arbitrary amount of memory to parse
them. By this (highly informal) justification, HTML should be non-regular; and
as with the palindrome example, it is possible to formally prove that this is
indeed the case! This is why all those Stack Overflow users were so adamant
that you cannot parse HTML with regular expressions.

Since regular expressions are not very powerful, we need a better way of
describing languages. One way of doing this is by using a *context-free
grammar* (CFG). A CFG is essentially a set of instructions, called *production
rules*, which tell you how to construct strings in a language. This concept is
easier to explain by example. Recall the language $L_1$, which we looked at in
the previous section. The following is a CFG which describes $L_1$:

The symbol $\langle S \rangle$ is the *start symbol*. When we want to construct
a string in $L_1$, we start by writing down $\langle S \rangle$. Then, the
above CFG tells us that we can replace $\langle S \rangle$ with either
$\text{A} \langle S \rangle \text{B}$ or the empty string $\varepsilon$. As
long as the symbol $\langle S \rangle$ is still present, we can continue
replacing it according to these rules. For example, to generate the string
$\text{AABB}$, we would do the following:

For the first 3 steps, we replace $\langle S \rangle$ with $\text{A} \langle S \rangle \text{B}$. For step 4, we replace $\langle S \rangle$ with the empty string.

We immediately see that context-free grammars are more powerful than regular expressions: $L_1$ can be described by a CFG, but not by a regular expression.

The set of palindromes consisting of capital English letters (which, as argued above, is a non-regular language), can also be described by a CFG:

\[\begin{align*} \langle S \rangle &\>\>\>\to\>\>\> \langle X \rangle,\>\> \langle Y \rangle,\>\> \varepsilon \\ \langle X \rangle &\>\>\>\to\>\>\> \text{A} \langle S \rangle \text{A}, \>\> \text{B} \langle S \rangle \text{B},\>\> \text{C} \langle S \rangle \text{C},\>\> ...,\>\> \text{Z} \langle S \rangle \text{Z} \\ \langle Y \rangle &\>\>\>\to\>\>\> \text{A}, \>\> \text{B},\>\> \text{C},\>\> ...,\>\> \text{Z} \end{align*}\]This CFG is much more complicated, and has many production rules. The first line says that we can replace the start symbol $\langle S \rangle$ with $\langle X \rangle$, or $\langle Y \rangle$, or the empty string. In turn, we can replace $\langle X \rangle$ with $\text{A} \langle S \rangle \text{A}$, or $\text{B} \langle S \rangle \text{B}$, or $\text{C} \langle S \rangle \text{C}$, etc. The third line says that we can replace $\langle Y \rangle$ with any letter of the alphabet.

To construct the palindrome $\text{ABCCBA}$ using this CFG, we would do the following:

\[\begin{align*} 1.\>\> &\langle S \rangle \\ 2.\>\> &\langle X \rangle \\ 3.\>\> &\text{A} \langle S \rangle \text{A} \\ 4.\>\> &\text{A} \langle X \rangle \text{A} \\ 5.\>\> &\text{AB} \langle S \rangle \text{BA} \\ 6.\>\> &\text{AB} \langle X \rangle \text{BA} \\ 7.\>\> &\text{ABC} \langle S \rangle \text{CBA} \\ 8.\>\> &\text{ABCCBA} \\ \end{align*}\]To construct the palindrome $\text{TENET}$, we would do the following:

\[\begin{align*} 1.\>\> &\langle S \rangle \\ 2.\>\> &\langle X \rangle \\ 3.\>\> &\text{T} \langle S \rangle \text{T} \\ 4.\>\> &\text{T} \langle X \rangle \text{T} \\ 5.\>\> &\text{TE} \langle S \rangle \text{ET} \\ 6.\>\> &\text{TE} \langle Y \rangle \text{ET} \\ 7.\>\> &\text{TENET} \\ \end{align*}\]HTML can also be described by a context-free grammar^{*}. It is, however, extremely
complicated. One production rule for HTML might look like this:

This rule says that valid HTML can consist of an opening `<p>`

tag and closing
`</p>`

tag, with valid HTML in-between. The CFG would include a similar
production rule for every possible HTML tag. Of course, even this single
production rule is highly oversimplified, and omits many other things which may
be included in a `<p>`

tag, such as attributes (e.g. `<p style="color: red">`

).
Unfortunately, the full CFG for HTML would be too complex to write
down here.

“I'm going to make my own regular expressions! With blackjack! And hookers!”

— A modern programmer implementing a modern regex engine

Regular expressions in modern programming languages are far more powerful than the ones defined in the formal theory of languages. To illustrate, consider the following Ruby regex:

```
/\A
(?<S> (A \g<S> B)?)
\z/x
```

The above regex takes advantage of an advanced feature called *recursive
subexpression calls* to mimic the structure of a CFG. The important parts are
on the second line; let’s go through them step-by-step. The outer most part,
`(?<S> ...)`

, means “create a group called `<S>`

”. The inner contents tell Ruby
what the group `<S>`

consists of. In this case, it consists of `(A \g<S> B)?`

.
That is, it consists of the letter `A`

, followed by the group `<S>`

, followed
by `B`

. The `?`

at the end means that the whole thing is optional, or in other
words, can be replaced with the empty string. To summarize: this regex matches
strings which consist of `<S>`

, where `<S>`

is either `A<S>B`

or the empty
string. This should look familiar: it’s exactly our CFG for $L_1$, written in
a way that can be understood by Ruby’s regex engine! So, this Ruby “regex” is
actually describing a non-regular language!

Similarly, the following Ruby regex mimics the CFG for palindromes we had above:

```
/\A
(?<S> (\g<X> | \g<Y>)?)
(?<X> (?<letter>[A-Z]) \g<S> \k<letter+0>){0}
(?<Y> [A-Z]){0}
\z/x
```

This will match palindromes consisting of capital English letters.

The takeaway from all this is that Ruby regexes are powerful enough to describe
CFGs. And since HTML is a CFG^{*}, you should be able to write a Ruby regex which
matches HTML. I, however, am not bored enough to actually attempt such a thing.
Perhaps you, dear reader, are willing to undertake this task. However, if you
do so, I must warn you that some unfortunate side effects may occur, including
(but not limited to) fits of rage, loss of sanity, and the summoning of Cthulhu
himself.

^{*}Caveat: this should be true at least in theory, given HTML's tree structure. However, in practice, many browsers will be able to parse "invalid" HTML (for example, you can forget to add a closing tag, and in most cases the browser will be perfectly fine with that). Even worse, browsers don't all agree on what valid HTML actually is. Since this blog post is about regexes, and not about HTML, I decided to leave out any discussion about how to actually define HTML.