Visualization of Regular Expressions

Graphrex creates diagrams that show the structure and flow in your regular expressions. All the constructs of Java regular expressions are supported. This enables a visualization of any valid regular expression to be constructed for you by Graphrex.  This page describes the visualizations in detail.



Simple Graphrex diagrams

This diagram represents the regular expression foo. The green arrow is inserted by Graphrex to mark the beginning of the regular expression, and the red square is inserted to mark its end. Whatever is in between is a visual representation of the regular expression itself. In this example, the regular expression will match the sequence of letters f o o and Graphrex concatenates them into a string box.

Here’s a slightly more complex example, a regular expression foo|bar that matches either of the strings foo or bar.  The match that is attempted first (foo) is placed uppermost.

Visualization of a more complex regular expression

More complex diagram

Let’s look at an example of a more complex regular expression. This is the Graphrex diagram for the regular expression
\([0-9]{3}\)|[0-9]{3}-?
(we have left out the start and finish icons).

This diagram introduces several more constructs that are described more fully in later sections.  These include:

  • Optional elements, in this case the dash;
  • Character classes (sets), in this case the class that consists of the digits from o to 9;
  • Repetition, specifically the requirement that three consecutive digits be matched;
  • And finally the alternative matches that have already been introduced.

The arrows show the path that the regular expression recognizer will take as it attempts to match the input. In the first (upper) alternative, it will first try to recognize a left parenthesis. If it succeeds it will next try to match a sequence of three digits. Then it will look for a right parenthesis. If the upper path doesn’t match, the recognizer will backtrack and try the lower path, seeking a sequence of three digits followed by an optional dash. The structure of the diagram directly reflects the structure of the regular expression, and the lines show the path that the matcher will use in trying to match the input.

Visual equivalents of regular expression elements.

Graphrex provides a visual representation of all the possible elements of a regular expression. We’ll cover them in the order in which they are listed in the documentation of the Pattern class in the Java regex package, where the syntax of regular expressions is described. For more information about each one, consult the resources in our page about regular expressions.

Characters

Sequences of ordinary characters are concatenated into strings, as in the first two examples above. In the third example we showed that an escaped character \( in a string is shown without the preceding backslash. This is for simplicity. Octal and unicode characters are shown in a unicode font.

Here’s the visualization of a regular expression that matches a sequence of four unicode characters \u0445\u043b\u0435\u0431 (they spell the Russian word for “bread”). Graphrex uses the system unicode font to display the characters.

Non printable characters rendered as visual equivalents.Non-printable characters such as \u0000, \n, \t etc are displayed using code point conversion to show a visual representation. For example, the visualization of the regular expression \tfirst class\n is shown here. The non-printable characters (horizontal) tab and line feed are represented visually. Also a space is shown by a special character so that it is more easily seen.

Character classes

Graphrex visualizes character classes and enables the members of the class to be viewed.

Character classesHere’s how Graphrex shows the regular expression [a-z][0-9], which matches two consecutive characters, one from the class (set) of letters a through z, and the next from the set of digits 0 though 9. You can optionally open the a character class box using the drop-down arrow to see a scrollable list of the members of the class and their unicode character points.

The character class operators used in regular expressions are negation, unionintersection, and combinations of these. The operators are shown explicitly in Graphrex visualizations, and each operator is a button which, when pressed by the user, drops down a scrolling list of the membership of the character set that results from the application of the operator.

Negation operator showing class membersTake the negation operator as an example. This diagram shows the visualization of the character set [^cde], which matches any character except c, d and e. Here the user has pressed the button for the negate (^) symbol, and this drops down the blue scrolling list of characters that are in the class formed by negating (taking the complement of) the class [cde].

In regular expressions there is no explicit union operator for character classes. Instead, this operation is implicit in the structure of the class expression. Take the example [a-zA-Z0-9]. This means the union of the three classes [a-z], [A-Z] and [0-9]. In other words, this regular expression matches either a character from the range a-z or one from the range A-Z, or one from the range 0-9. The diagram shows how Graphrex provides an explicit representation of the union of the three classes, by bracketing them with a union operator button.

One of the three constituent character classes is shown open (it has a white background), and the result of the union operation is also shown open (blue background). As  before, by opening the class the user can see the unicode equivalent of each character, and can scroll to see all the characters in the class.

Intersection (&&) is an explicit operator in regular expressions, and is visualized by Graphrex in a similar way to the union operator.

Character class operators in combinationCombinations of operators. As a final example of character class operations, consider the regular expression 
[a-z0-9&&[^m-p]]

which includes union, intersection and negation operations. Graphrex visualizes this and other complex combinations of character classes.

Predefined character classes

Predefined character classesA very common predefined character class is the one of all characters, represented by a dot. For example, the regular expression
.\d
contains two predefined character classes: any character, and any digit. Its visualization in Graphrex is shown in the figure. As with other character classes, the list of characters in the class can be opened using the drop down arrow. The character class \d is also shown. The names of predefined character classes are shown in parentheses on a gray background.

By default the dot character class does not include line terminators. You can see that in the preceding figure the characters \n (\u000a) and \r (\u000d) are not included in the scrollable list.  But if the DOTALL mode is set or the embedded flag expression (?s) — discussed later — is set, then line terminators are included and the name of the class is  “(Any_DOTALL)”. Similarly, if the mode UNIX_LINES or the flag expression (?d) are set,  then only the \n line terminator is included and the name of the predefined character class will be “(Any_UNIX_LINES)”.

Other character classes: POSIX, java.lang.Character and Unicode

Graphrex handles the visualization of the other kinds of predefined character classes in a similar way. For each there is a character class box with the name of the class in parenthesis, and a drop-down scrollable list showing all the members of the class.

Boundary matchers

Boundary matchers for the beginning and end of a lineBoundary matchers include ^ (the beginning or a line) and $ (the end of a line) as well as others like \A (the beginning of the input). They appear in Graphrex diagrams with an icon of a traffic cone and the name of the matcher. The figure shows an example: the regular expression ^foo$, which will match a line that contains only the word “foo”.

The drop-down list on a boundary match box is to allow additional function in a future version of Graphrex.

Quantifiers: greedy, reluctant and possessive

Quantifiers control how many times a given match must occur. They may be greedy, ie the match will be made whatever follows in the input, reluctant (the match will succeed only if it doesn’t match the next character), or possessive (the matcher will not match something different if it has to backtrack because a later match fails). Graphrex has a visual representation of these cases for each of the types of quantifiers.

Quantifier exampleLet’s look in detail at the regular expression <.*?/> which will match a pair of angle brackets with any number of characters between them. The match is made reluctant by the presence of the question mark so that the .* cannot consume the closing angle bracket. Reading from the top of the diagram, and remembering that the convention in Graphrex’s diagrams is  that the matcher tries paths starting at the top of the diagram, we see that if the regex engine matches a left angle bracket, the next path it will try is the one that goes straight on the closing angle bracket. If that match succeeds we are done. Otherwise the matcher backtracks to the preceding matched character and continues forward again but this time takes the alternative path (the one just after the ‘<‘) and tries to match a member of the predefined set  Any, i.e. any character. This match must succeed if there’s a character in the input, in which case the matcher continues and tries to match the closing angle bracket. If that doesn’t work it backtracks in a special way to the previous match (i.e.  Any), and this times takes the path with rounded corners and open arrows (which represents the Kleene star operation) which takes it back to try Any again.

Table of quantifiers

Visualizations of all the quantifiers used in regular expressions, for different levels of greed (greedy, reluctant and possessive) are shown in the following table:

Number of matches Greedy Reluctant Possessive
Once or not at all Once or not at all, greedy Once or not at all, reluctant Once or not at all, possessive
X? X?? X?+
Zero or more times Zero or more times, greedy Zero or more times, reluctant Zero or more times, possessive
X* X*? X*+
One or more times One or more times, greedy One or more times, reluctant One or more times, possessive
X+ X+? X++
Exactly n times (n=3) Exactly 3 times, greedy Exactly 3 times, reluctant Exactly 3 times, possessive
X{3} X{3}? X{3}+
At least n times (n=3) At least 3 times, greedy At least 3 times, reluctant At least 3 times, possessive
X{3,} X{3,}? X{3,}+
At least n but not more than m times
(n=3, m=10)
At least 3 but not more than 10 times, greedy At least 3 but not more than 10 times, reluctant At least 3 but not more than 10 times, possessive
X{3,10} X{3,10}? X{3,10}+

Logical operators

The three logical operators supported by regular expressions are concatenation XY, alternation X|Y and capture (X). In this section we’ll see how Graphrex helps you to visualize these operations.

Concatenation

In a regular expression, elements are concatenated by simply following one another in the expression; there is no explicit concatenation operator.

Concatenation of two elementsWhen elements are concatenated in a regular expression, Graphrex simply connects them with an arrow. The arrow shows the order in which the matcher will look for occurrences of the elements in the input. Here we see the concatenation of an alphabetic character followed by zero or more word characters, ie the regular expression \d\w*.

Concatenation of characters is a special case. When consecutive elements are simple characters, Graphrex shows them together as a string. Here we see the visualization of the regular expression foo.

Alternation

Alternation just means that the matcher will first try the first path, then if that fails the matcher will try the next alternative path, then the next one and so on until there are no more alternatives at which point the match fails.

A regular expression of this kind is X|Y|Z|last. Graphrex shows the alternative paths stacked one above the other. The matcher starts with the first one, which is at the top. This is a simple example — alternative paths are often complex and it may help to zoom out to see the overall structure of the diagram in the Graphrex editor.

Capture

We are following the Javadocs for the Pattern class in calling capture a logical operator. It might fit better with our discussion of other groups, in a later section.

Like other groups in Graphrex, a capturing group is visualized as an enclosing box with gray title bar and an icon that represents its function. In this case the icon is the “record” symbol, since loosely speaking a capturing group remembers what it matches. We have illustrated this by the regular expression (a|b)(1|2) with two consecutive capturing groups.

Capturing groups are numbered by the regular expression processor, and Graphrex shows their numbers in its diagrams. The exception is capturing group zero, which is the whole expression, and which Graphrex doesn’t show in any special way.

Back references

A back reference is a reference to the input that was captured in a previous capturing group. The regular expression (a|b|c)\1{3} is shown here. It will match the input aaaa, bbbb or cccc.

Quotation and character escapes

A quoted character in a regular expression will have its regular meaning, not any special meaning that might be assigned to it by the regular expression language. Graphrex shows such characters as members of a string. The example here shows the regular expression abc\(\Q.*\E which has a quoted single character \( and a quoted range \Q.*\E. Graphrex concatenates all the resulting characters with the preceding abc to form a single string.

Non-capturing groups

Non-capturing groups come in several flavors. A simple non-capturing group, in this example (?:a|b), is shown in a Graphrex diagram as a group box with an empty title bar. Non-capturing groups do not have an icon or a group number.

A match flags box in a Graphrex diagram

However a non-capturing group can specify match flags that influence how the matching engine compares the input to the regular expression. If the non-capturing group has no content, the flags are honored from that point on in the regular expression.

A non-capturing group containing only match flags without other content, is shown here. It represents the regular expression fragment (?xs-u). It can be opened by the user by clicking on the down arrow symbol. When it is closed, it shows just the flag expression, in this case xs-u. When it is opened the individual flags are shown with their current state.

The state of a match flag at any point in a regular expression can have one of three values: on (checked box), off (red X in the box) or not set (grey box). The latter value means that the flag state is either the default or is determined by another match flag earlier in the regular expression.

We have already seen how the state of some flags influences what the user sees in a Graphrex diagram. Specifically, the flags s (‘.’ matches line terminators), m (multiline mode)  and u (Unix lines mode) control which line terminator characters are included in the character set any character that is  represented by a single dot in the regular expression. The effect of the x flag (permit whitespace and comments) is described next.

Whitespace and comments

When the match flag x is set, the regular expression matcher ignores whitespace in a regular expression, and enables comments. Comments begin with the # character and run to the end of a line.

The effect of the x flag in a Graphrex diagram is shown here for the following regular expression:

(?x)
#Note that the space is ignored
match this
(?-x)and this

When the x flag is in effect the comment is recognized and placed in a special inline box (consecutive comments go in the same box), while the spaces and newlines are ignored. Turning off the x flag restores the default behavior.

Match flags in a non-capturing group

A non capturing group can include match flags that apply only to the sub-expression within the group. For example, the expression (?ds-u:X) means “match X with the flags d and s turned on and the flag u turned off.” Graphrex visualizes this construct by putting the match flags in the title bar of the group. As before, the match flags can be opened to show the state of individual flags as check boxes. Opening the match flags expands the title bar.

This fragment of a Graphrex diagram shows a non-capturing group to which match flags are applied. The regular expression is (?i:X).

Non-capturing groups with lookaround

Lookaround comes in two flavors: lookahead and lookbehind. Each flavor can be either positive or negative. Graphrex distinguishes these four different types with icons in the title bar of the non-capturing group.

Zero width negative lookbehind groupAn example of a non-capturing group with negative lookbehind is shown here. The arrow points to the left, that’s the “behind” (“ahead” points to the right.) The red strikeout indicates negative (positive has no strikeout). The regular expression for this example is (?<!X).

Independent non-capturing groups

Independent non-capturing group (aka "atomic group")Independent non-capturing groups in regular expressions are also called atomic groups. Here we see a piece of a Graphrex diagram that displays the independent non-capturing group whose regular expression is (?>X).

Comments are closed.