Pattern Matching Reference
A comprehensive guide to regex syntax, patterns, and best practices across different flavors and implementations. Master the art of pattern matching.
Essential regex syntax at a glance. Your go-to cheat sheet for common patterns and metacharacters.
. Any character except newline
\d Digit [0-9]
\D Not digit [^0-9]
\w Word character [a-zA-Z0-9_]
\W Not word character
\s Whitespace [ \t\r\n\f\v]
\S Not whitespace
[abc] Any of a, b, or c
[^abc] Not a, b, or c
[a-z] Range a through z
* 0 or more (greedy)
+ 1 or more (greedy)
? 0 or 1 (greedy)
{n} Exactly n times
{n,} n or more times
{n,m} Between n and m times
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
*+ Possessive (PCRE/Java)
^ Start of string/line
$ End of string/line
\b Word boundary
\B Not word boundary
\A Start of string (always)
\Z End of string (always)
\z Absolute end (PCRE/Python)
\G Start of search (Perl/PCRE)
(...) Capturing group
(?:...) Non-capturing group
(?<name>...) Named group (PCRE/.NET)
(?P<name>...) Named group (Python)
\1, \2, ... Backreference
\k<name> Named backreference
(?>...) Atomic group
(?=...) Positive lookahead
(?!...) Negative lookahead
(?<=...) Positive lookbehind
(?<!...) Negative lookbehind
i Case-insensitive
g Global (match all)
m Multiline (^ $ match lines)
s Dotall (. matches newline)
x Extended (ignore whitespace)
u Unicode
Foundational regex concepts: literals, metacharacters, and character classes.
Most characters match themselves literally. Regular text is interpreted character by character:
cat Matches "cat"
hello Matches "hello"
123 Matches "123"
Special characters with reserved meanings. Must be escaped with backslash to match literally:
. ^ $ * + ? { } [ ] \ | ( )
To match these literally, prefix with backslash:
\. Matches a literal dot
\$ Matches a dollar sign
\(hello\) Matches "(hello)" literally
\* Matches a literal asterisk
The dot . matches any single character except newline:
. Matches any single character except newline
a.c Matches "abc", "a9c", "a c", etc.
..... Matches any 5 characters
/s flag (dotall mode), the dot matches newlines too.
Define a set of characters to match. Use square brackets to create a class:
[abc] Matches 'a', 'b', or 'c'
[aeiou] Matches any vowel
[0-9] Matches any digit
[a-z] Matches any lowercase letter
[a-zA-Z] Matches any letter
[a-zA-Z0-9] Matches any alphanumeric character
Use ^ at the start to negate:
[^abc] Matches anything except 'a', 'b', or 'c'
[^0-9] Matches anything except digits
[^\s] Matches any non-whitespace character
\d Digit [0-9]
\D Non-digit [^0-9]
\w Word character [a-zA-Z0-9_]
\W Non-word character [^a-zA-Z0-9_]
\s Whitespace [ \t\r\n\f\v]
\S Non-whitespace [^ \t\r\n\f\v]
\h Horizontal whitespace (PCRE)
\v Vertical whitespace (PCRE)
Control how many times an element repeats. Understanding greedy vs lazy is essential.
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
| * | 0 or more | a* | "", "a", "aa", "aaa", ... |
| + | 1 or more | a+ | "a", "aa", "aaa", ... |
| ? | 0 or 1 | a? | "", "a" |
| {n} | Exactly n | a{3} | "aaa" |
| {n,} | n or more | a{2,} | "aa", "aaa", "aaaa", ... |
| {n,m} | Between n and m | a{2,4} | "aa", "aaa", "aaaa" |
Match as much as possible. Quantifiers are greedy by default:
<.+> In "<p>Hello</p>", matches entire "<p>Hello</p>"
\d+ In "12345", matches "12345"
Match as little as possible. Add ? after quantifier:
<.+?> In "<p>Hello</p>", matches "<p>" and "</p>" separately
\d+? In "12345" with global flag, matches "1", "2", "3", "4", "5"
Common lazy quantifiers:
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
{n,}? n or more (lazy)
{n,m}? Between n-m (lazy)
Extract text between quotes:
// Greedy - matches from first to last quote
"(.*)" In '"Hello" and "World"' → '"Hello" and "World"'
// Lazy - matches each quoted string separately
"(.*?)" In '"Hello" and "World"' → '"Hello"' and '"World"'
Available in PCRE, Java, and .NET. Never backtrack once matched:
*+ 0 or more (possessive)
++ 1 or more (possessive)
?+ 0 or 1 (possessive)
{n,}+ n or more (possessive)
{n,m}+ Between n-m (possessive)
\d++\d never matches because possessive ++ consumes all digits without backtracking. Use for performance optimization and preventing catastrophic backtracking.
Match positions, not characters. Zero-width assertions that anchor patterns to specific locations.
^ Start of string (or line in multiline mode)
$ End of string (or line in multiline mode)
Examples:
^Hello Matches "Hello" only at start of string
world$ Matches "world" only at end of string
^Hello$ Matches entire string "Hello" (nothing before or after)
Without /m flag:
^ Matches start of entire string
$ Matches end of entire string
With /m flag:
^ Matches start of string AND start of each line
$ Matches end of string AND end of each line
Line 1
Line 2
Line 3
/^Line/ Matches "Line 1" only (1 match)
/^Line/m Matches "Line" at start of each line (3 matches)
\A Start of string (always, ignores multiline mode)
\Z End of string before final newline (always)
\z Absolute end of string (PCRE/Python)
\b Word boundary (between \w and \W)
\B Not word boundary
Examples:
\bcat\b Matches "cat" in "the cat sat" but not in "concatenate"
\Bcat\B Matches "cat" in "concatenate" but not in "the cat sat"
\bcat Matches "cat" at start of word: "cat", "caterpillar"
cat\b Matches "cat" at end of word: "cat", "bobcat"
\w and \W\w\w and end of stringExtract and reference matched substrings. Essential for complex pattern matching and replacements.
Parentheses create capturing groups that store matched text:
(abc) Captures "abc"
(\d{4}) Captures 4 digits
([a-z]+) Captures one or more lowercase letters
(\d{4})-(\d{2})-(\d{2})
Matching "2026-02-09" captures:
Group 1: "2026"
Group 2: "02"
Group 3: "09"
Access captured groups:
$1, $2 (JavaScript, Perl) or \1, \2 (sed, vim)match.groups(), match[1], etc.Use (?:...) when you need grouping but don't need to capture:
(?:abc)+ Matches "abc", "abcabc", "abcabcabc", ... (doesn't capture)
(?:\d{4})- Matches year followed by dash, doesn't capture year
(?P<name>pattern) Define named group
(?P=name) Backreference to named group
# Example
(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})
# Access: match.group('year'), match.group('month'), match.group('day')
(?<name>pattern) Define named group
\k<name> Backreference to named group
// JavaScript Example
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
// Access: match.groups.year, match.groups.month, match.groups.day
Reference previously captured groups within the same pattern:
\1, \2, \3, ... Reference groups 1, 2, 3, ...
\k<name> Reference named group (PCRE/JavaScript)
(?P=name) Reference named group (Python)
Examples:
(\w+)\s+\1 Matches repeated words: "the the", "hello hello"
(["'])(.*?)\1 Matches quoted strings with same quote type
<([a-z]+)>.*?</\1> Matches HTML tags: <p>...</p>, <div>...</div>
\b(\w+)\s+\1\b Matches "word word", "test test"
Prevent backtracking once matched (PCRE, Java, .NET):
(?>pattern) Atomic group
(?>a+)b Never matches "aaaa" (possessive + consumes all 'a's)
a+b Can match "aaaa" followed by "b"
Zero-width assertions that match positions without consuming characters. Essential for complex validation.
Assert that pattern CAN be matched ahead:
\d(?=px) Matches digit followed by "px" (doesn't include "px")
In "10px", matches "10"
q(?=u) Matches "q" followed by "u"
In "question", matches "q"
Assert that pattern CANNOT be matched ahead:
\d(?!px) Matches digit NOT followed by "px"
In "10px 20em", matches "2" and "0"
q(?!u) Matches "q" NOT followed by "u"
In "Iraq", matches "q"
Assert that pattern CAN be matched behind:
(?<=\$)\d+ Matches digits preceded by "$"
In "$100", matches "100"
(?<=[a-z])[A-Z] Matches uppercase letter after lowercase
In "testCase", matches "C"
Assert that pattern CANNOT be matched behind:
(?<!\$)\d+ Matches digits NOT preceded by "$"
In "$10 20", matches "20"
(?<![a-z])[A-Z] Matches uppercase letter NOT after lowercase
In "TestCase", matches "T"
Minimum 8 chars, requires uppercase, lowercase, digit, special:
^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[^A-Za-z0-9\s]).{8,}$
Breakdown:
^ Start of string
(?=.*[A-Z]) Must contain uppercase
(?=.*[a-z]) Must contain lowercase
(?=.*\d) Must contain digit
(?=.*[^A-Za-z0-9\s]) Must contain special character
.{8,} At least 8 characters total
$ End of string
(?<=\$)\d+(?:\.\d{2})?
Matches: "$10.53" → "10.53"
(?<=[a-z])(?=[A-Z])
Position between: "testCase" → "test" + "Case"
Alphanumeric, 3-16 chars, not all digits:
^(?!^\d+$)[a-zA-Z0-9]{3,16}$
(?<=\d{4}) OK (fixed width: 4)
(?<=\d+) May fail (variable width)
Modern JavaScript (ES2018+) and Python support variable-width lookbehind.
Modifiers that change regex behavior. Syntax varies by language.
/pattern/flags
new RegExp('pattern', 'flags')
/hello/i
/\d+/g
/^line/m
import re
re.compile(r'pattern', re.FLAG)
re.IGNORECASE or re.I
re.MULTILINE or re.M
re.DOTALL or re.S
re.VERBOSE or re.X
re.UNICODE or re.U
# Example
re.compile(r'hello', re.I | re.M)
/pattern/imsxg
m/pattern/imsxg
/hello/i
/\d+/g
Set flags within the pattern itself:
(?i)case-insensitive Turn on case-insensitive
(?-i)case-sensitive Turn off case-insensitive
(?i:pattern) Case-insensitive for this group only
(?ims) Multiple flags
Examples:
(?i)hello Matches "hello", "Hello", "HELLO"
hello(?i)world Only "world" is case-insensitive
(?i:hello) world Only "hello" is case-insensitive
Different regex engines have varying features and syntax. Know your environment.
\(group\) Grouping
\{n,m\} Quantifiers
\| Alternation
(group) Grouping
{n,m} Quantifiers
| Alternation
| Feature | BRE | ERE | PCRE | JavaScript | Python | Java |
|---|---|---|---|---|---|---|
| Basic matching | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Character classes | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Lazy quantifiers | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| Possessive quantifiers | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| Non-capturing groups | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| Named groups | ✗ | ✗ | ✓ | ✓ (ES2018+) | ✓ | ✓ |
| Lookahead | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| Lookbehind | ✗ | ✗ | ✓ | ✓ (ES2018+) | ✓ | ✓ |
| Atomic groups | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ |
| Unicode | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
Battle-tested regex patterns for everyday tasks. Copy, paste, adapt.
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Breakdown:
[a-zA-Z0-9._%+-]+ - Local part (username)@ - Literal @[a-zA-Z0-9.-]+ - Domain name\. - Literal dot[a-zA-Z]{2,} - TLD (2+ letters)^(https?:\/\/)?([\w-]+\.)+[a-zA-Z]{2,}(\/[\w\-\.~:/?#\[\]@!\$&'()\*\+,;=%.]*)?$
Breakdown:
(https?:\/\/)? - Optional protocol([\w-]+\.)+ - Domain parts (subdomains)[a-zA-Z]{2,} - TLD(\/...)? - Optional path, query, fragmentMatch URLs in text:
https?:\/\/[^\s]+
IPv4:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Breakdown:
25[0-5] - 250-2552[0-4][0-9] - 200-249[01]?[0-9][0-9]? - 0-199Simpler (less strict):
^(\d{1,3}\.){3}\d{1,3}$
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$
// YYYY-MM-DD (ISO 8601)
^(0[1-9]|1[0-2])\/(0[1-9]|[12][0-9]|3[01])\/\d{4}$
// MM/DD/YYYY
^(0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[0-2])-\d{4}$
// DD-MM-YYYY
^(\+1[-.]?)?(\(?\d{3}\)?[-.]?)?\d{3}[-.]?\d{4}$
Matches:
International E.164 format:
^\+[1-9]\d{1,14}$
^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[^A-Za-z0-9\s]).{8,}$
// Min 8 chars, 1 uppercase, 1 lowercase, 1 digit, 1 special
^(?=.*[A-Za-z])(?=.*\d).{8,}$
// Min 8 chars, at least 1 letter and 1 digit
^\S{6,20}$
// No whitespace, 6-20 characters
^[a-zA-Z0-9_]{3,16}$
// Alphanumeric, 3-16 characters, underscores allowed
^[a-zA-Z][a-zA-Z0-9_-]{2,15}$
// Start with letter, alphanumeric + underscore + hyphen, 3-16 chars
<([a-z]+)([^>]*)>(.*?)<\/\1>
// HTML tags with backreference
<!--.*?-->
// HTML comments
<[^>]*>
// Strip HTML tags
\.([a-zA-Z0-9]+)$
// File extension
^\/(?:[^\/\0]+\/?)*$
// Unix absolute path
^[a-zA-Z]:\\(?:[^\\/:*?"<>|\r\n]+\\)*[^\\/:*?"<>|\r\n]*$
// Windows path
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
// 3 or 6 digit hex color
^[a-z0-9]+(?:-[a-z0-9]+)*$
// Lowercase letters, numbers, hyphens
Optimize regex for speed and prevent catastrophic backtracking. Security matters.
Certain patterns cause exponential time complexity, making them vulnerable to ReDoS (Regular Expression Denial of Service) attacks.
(a+)+ Dangerous
(a*)* Dangerous
(a+)* Dangerous
(a|a)* Dangerous
(a|b)*a Can be dangerous with long non-matching input
^(a+)+$
Input: "aaaaaaaaaaaaaaaaaaaX" (19 a's, then X)
The engine tries exponential combinations:
- 19 groups of 1 'a' each
- 1 group of 19 'a's
- 9 groups of 2, 1 of 1
- ... (2^19 combinations)
Result: Exponential time O(2^n)
Red flags:
(a+)+, (a*)*(a|a)*, (a|ab)*(a?)*(abc|abd)*// Vulnerable
^(a+)+$
// Fixed with atomic group
^(?>a+)+$
// Vulnerable
^(a+)+$
// Fixed with possessive quantifier (PCRE/Java)
^a++$
// Vulnerable
^(a+)+$
// Fixed - simplified
^a+$
// Vulnerable
^([a-zA-Z0-9])+@([a-zA-Z0-9])+$
// Fixed - removed unnecessary groups
^[a-zA-Z0-9]+@[a-zA-Z0-9]+$
// Vulnerable - unbounded repetition
.*
// Better - bounded
.{0,100}
// Better - specific character class
[a-zA-Z0-9]{0,100}
// Slow - engine must try every position
\d{4}-\d{2}-\d{2}
// Faster - engine knows to start at beginning
^\d{4}-\d{2}-\d{2}
// Slower - . matches anything, more backtracking
.*?@.*
// Faster - specific character classes
[^@]+@[^@]+
// Slower
(a|abc)
// Faster - longer/more specific first
(abc|a)
// Slower - unnecessary capturing
(\d+)\.(\d+)\.(\d+)\.(\d+)
// Faster - if you don't need captures
(?:\d+)\.(?:\d+)\.(?:\d+)\.(?:\d+)
// Fastest - no groups if structure is simple
\d+\.\d+\.\d+\.\d+
// Slower - greedy, lots of backtracking
<.*>
// Faster - lazy
<.*?>
// Fastest - specific negated class
<[^>]*>
// Slow - compiles on every iteration
for line in lines:
if re.match(r'\d+', line):
...
// Fast - compile once
pattern = re.compile(r'\d+')
for line in lines:
if pattern.match(line):
...
When to use RE2:
Advanced techniques, debugging strategies, and real-world usage patterns.
# Basic regex (BRE)
grep 'pattern' file.txt
# Extended regex (ERE)
grep -E 'pattern' file.txt
egrep 'pattern' file.txt
# Perl regex (PCRE) - if supported
grep -P 'pattern' file.txt
# Case-insensitive
grep -i 'pattern' file.txt
# Invert match (lines NOT matching)
grep -v 'pattern' file.txt
# Show line numbers
grep -n 'pattern' file.txt
# Recursive search
grep -r 'pattern' directory/
# Show only matching part
grep -o 'pattern' file.txt
# Basic substitution (BRE)
sed 's/pattern/replacement/' file.txt
# Extended regex (ERE)
sed -E 's/pattern/replacement/' file.txt
# Global replacement (all occurrences per line)
sed 's/pattern/replacement/g' file.txt
# Case-insensitive (GNU sed)
sed 's/pattern/replacement/i' file.txt
# Backreferences
sed 's/\(word\)/[\1]/g' file.txt # BRE
sed -E 's/(word)/[\1]/g' file.txt # ERE
# Delete matching lines
sed '/pattern/d' file.txt
# Match lines with ERE
awk '/pattern/' file.txt
# Pattern in condition
awk '$1 ~ /pattern/' file.txt
# Negation
awk '$1 !~ /pattern/' file.txt
# Substitution
awk '{gsub(/pattern/, "replacement"); print}' file.txt
import re
# Basic matching
match = re.search(r'pattern', text)
if match:
print(match.group(0))
# Find all matches
matches = re.findall(r'\d+', text)
# Substitution
result = re.sub(r'pattern', 'replacement', text)
# Split
parts = re.split(r'[,;]', text)
# Compiled regex (better performance)
pattern = re.compile(r'\d+')
for line in lines:
match = pattern.search(line)
# Named groups
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, '2026-02-09')
print(match.group('year')) # '2026'
print(match.groupdict()) # {'year': '2026', ...}
// Literal syntax
const regex = /pattern/flags;
// Constructor (useful for dynamic patterns)
const regex = new RegExp('pattern', 'flags');
// Test (returns boolean)
if (/\d+/.test(text)) {
console.log('Contains digits');
}
// Match
const match = text.match(/\d+/);
if (match) {
console.log(match[0]); // Matched text
console.log(match.index); // Position
}
// Global match
const matches = text.match(/\d+/g); // Array of all matches
// Replace
const result = text.replace(/pattern/g, 'replacement');
// Replace with function
const result = text.replace(/\d+/g, (match) => parseInt(match) * 2);
// Named groups (ES2018+)
const regex = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = text.match(regex);
console.log(match.groups.year); // '2026'
Essential Tools:
// Python - verbose mode with comments
import re
pattern = re.compile(r'''
^ # Start of string
(?P<year>\d{4}) # Year (4 digits)
- # Separator
(?P<month>\d{2}) # Month (2 digits)
- # Separator
(?P<day>\d{2}) # Day (2 digits)
$ # End of string
''', re.VERBOSE)
Test Incrementally:
Start simple, add complexity step by step:
patterns = [
r'\d{4}', # Just year
r'\d{4}-\d{2}', # Year-month
r'\d{4}-\d{2}-\d{2}', # Full date
r'^\d{4}-\d{2}-\d{2}$', # With anchors
]