Regular expressions(regex) are powerful tools for pattern matching and string parsing. While the syntax can sometimes be difficult to understand, it can make many tasks more efficient. In this post, I'm going to demonstrate a feature of regular expressions called capture groups.
A few notes before we get started:
* JavaScript will be used for the examples in this post. The principles of regular expressions should be the same in other languages but some usage details may differ. If you aren't using JavaScript, refer to your language's documentation for specifics.
* Any discussion of overall efficiency with regard to software has to take development time and maintenance into account along with the runtime performance of the code. If runtime performance is your ultimate priority, regular expressions may not always be the best choice, because their performance can vary greatly depending on the characteristics of the input, the regular expression pattern, and the regular expression engine. In terms of development time and maintenance, regular expressions can reduce the amount and complexity of code necessary to accomplish some tasks. This is often a balancing act, and you can test your solutions with a benchmarking app like jsbench.me to be sure they align with your priorities. I also suggest writing unit tests for critical parsers to guard against regressions during future maintenance.
What is a capture group?
A capture group is a pattern within a regular expression that will be included in the result of calling RegExp.prototype.exec, String.prototype.match or String.prototype.matchAll.
Lets start with some basic pattern matching. Say we want to parse a substring containing one or more digits from a pathname like '/items/42'. The basic regular expression for that would be '/\/\d+/g'. This pattern will search a string for all matches (the 'g' or global flag) of a forward slash ('\/' — the preceding backslash is an escape) followed by one or more digits ('\d+'). This works very well if you want to test that a string contains that pattern:
RegExp.prototype.test returns true for any string that contains a forward slash followed by one or more digits, and false for any other string. Now let's try using RegExp.prototype.exec to get some information about the match:
'RegExp.prototype.exec' returns an array unless the string doesn't match the regex, in which case it returns 'null'. The array for each match contains one item, which is the string that matched. Additional properties attached to the array describe the 'index' at which the string appears and the original 'input' string. (You can access these properties via subscripting or dot notation, just like the properties of any object.) In cases where the input string contains more than one occurrence of the pattern '/\/\d+/' (e.g. `'/items/42/options/1'`), note that only the last occurrence has been matched. This is due to the use of the 'g' flag on the regex. You may omit the 'g' flag if you only care about the first match, but we'll benefit from it in a moment.
Now let's try 'String.prototype.match':
These results include every occurrence of the pattern in each string. If we had not used the 'g' flag on the pattern, we would have gotten the same result as calling 'RegExp.prototype.exec' without the 'g' flag — only the first match would be included, and the array would have additional properties for 'index' and 'input'. You can obtain the best of both approaches by using 'String.prototype.matchAll', which will return the input and index for every match:
There are tho things that are important to note:
- We're using 'Array.from' to coerce the result into an array. The return value of 'String.prototype.matchAll' is an iterator but in many cases it's easier to work with an array.
- 'String.prototype.matchAll' requires a global regex, meaning that you must include the 'g' flag.
We could actually work with this result if we wanted to hack around a bit. The following code will extract the digits from each match by replacing the forward slash with an empty string:
This accomplishes our basic task of extracting the digits, but isn't particularly clean. Fortunately, there's a better way. We can use a capture group to isolate the digits. The capture group is indicated by parentheses surrounding the part of the pattern you with to capture. In this case, one or more digits as specified by '\d+':
If we use 'String.prototype.matchAll' with this regex, we get a second array element for each match, which contains the digits we captured without the forward slash (because the forward slash was not included in the capture group). We can extract the digits simply by accessing the second element in the array for each match:
Capture groups give you an efficient way to extract substrings that match certain patterns from a larger string. There are many cases in which this will be sufficient, but sometimes the results can be awkward to work with. Maybe you need to relate the values extracted in the previous example to property names, where the first value should be called 'itemId' and the second 'optionId'. This would require more code to create an object containing these properties. One way to do this is to reduce the matches into an object, in which each value is keyed by the property name:
Again, this works but isn't very graceful. Relying on code to relate the matches to the corresponding property names could be fragile in some cases, depending on the specific regex and string that are being parsed. This is where named capture groups may help.
What is a named capture group?
A named capture group is similar to a regular capture group, but the syntax allows us to embed a name for each capture group in the regex. The syntax for a named capture group with the name 'id' and pattern '\d+' will be '(?<id>\d+)'. The name is placed within angle brackets, preceded by a question mark, and followed by the pattern. If we apply this to our previous example, we can obtain the named values in the 'groups' property on each match:
Now, you might have noticed that this doesn't solve our problem of needing extra code to resolve the values to property names, because the values are still distributed across separate matches and all have the same name. There's still no direct route to convert them to the object we want, which is '{ itemId: '42', optionId: '1' }'. We can change that by redesigning our regular expression. The following example uses a much more specific regex that describes the complete path we expect to match, including named capture groups for 'itemId' and 'optionId':
We can obtain the named values simply by accessing the 'groups' property of the first match:
(It's safe to assume that we can take the first match, if there is one, because the regex is anchored to the beginning and end of the string with '^' and '$'.)
The benefit of using named capture groups in this scenario is that names can be assigned to the values at the same time as they are parsed.
Choose your own adventure
In the previous example, we gained the ability to parse named values from a string in one shot, with no code other than what was necessary to execute the regex and access the result. We also lost the ability to use the generic regular expression we started with, because of the need to assign a unique name to each value. In reality neither approach is perfect for all situations. Unnamed capture groups enable you to use generic regex patterns because you don't have to worry about providing a unique name for each pattern, but traversing the results can be messy and error prone. Named capture groups can make parsing much cleaner, but only if the strings you have to parse have a consistent structure. Regular expressions that contain named capture groups may also be harder to understand. So which should you choose? Asking a series of questions may help:
- Would named capture groups provide a real benefit?
* If yes, go to 2.
* If no, go to 3. - Do the strings you need to parse have a consistent enough structure to make named capture groups feasible?
* If yes, go to 4.
* If no, go to 5. - Regular expressions that use unnamed capture groups are usually simpler and easier to understand, so it might be best to use them even if named capture groups might work in your scenario. Evaluate both options.
- Sounds like you might have found a good use case for named capture groups! Go to 6.
- Is there a reasonable workaround for the consistency problem? For example, could you design a regex that would cover all your input cases? (Without creating a nightmare for yourself and others?)
* If yes, go to 6.
* If no, go to 7. - Time to experiment! Test your workaround in comparison regular capture groups and any other approaches that you have in mind, and choose whichever one works best.
- Named capture groups are probably out of the question, so see if unnamed capture groups can do the job, or if you need to find a different solution.
Combining and nesting capture groups
Capture groups may be combined and nested. Consider the following regular expression:
This is similar to our previous regex, but a capture group has been added around the '/options' subpath: '(\/options\/(?<optionId>\d+))?'. The question mark after the closing parenthesis makes the capture group optional, which means that this regex will match the path with or without the '/options' subpath. (E.g. '/items/42/options/1' or '/items/42'.)
If we use this regex to match the full path, we get the following result:
The array contains four elements:
- The match for the full pattern (''/items/42/options/1'')
- The match for the 'itemId' named capture group (`'42'`)
- The match for the unnamed capture group around the '/options' subpath (`'/options/1'`)
- The match for the 'optionId' named capture group (`'1'`)
The 'groups' property contains the 'itemId' and 'optionId' values that were parsed from the path.
Now let's match the path without the '/options' subpath:
Once again, the array contains four elements for the full pattern match, 'itemId', '/options' subpath and 'optionId', but the '/options' subpath and 'optionId' are both 'undefined' because they're optional and were not present in the input. The 'groups' property contains the 'itemId' and 'optionId' values, but 'optionId' is similarly 'undefined'.
We could also use a named capture group around the '/options' subpath to include it in the groups:
Again, the 'subpath' group is 'undefined' in the second example because it wasn't present in the input.
The nested capture groups in these examples have enabled the regular expression to match variations on a path and parse the 'itemId' and 'optionId' whether the '/options' subpath is present or not, all in just a few lines of code. Try applying these techniques to simple string parsing tasks so you can get a feel for how they work.
Summary
In this post, we've learned how capture groups can make many string parsing tasks easier, but remember that regular expressions aren't the right solution. Some use cases that could be awkward or slow with regular expressions may be satisfied perfectly well by manipulating strings with code. Others may call for a more robust solution like pegjs. As always, experiment with all the tools you have at your disposal and see what works best for your application.