Regular Expressions in Python: A Complete Guide
Python's re module: core syntax, API functions (search, findall, sub, split), compiled patterns, and real-world email and log examples for placement coding rounds.
Python’s re module covers four operations on strings: test whether a pattern matches, extract the matching text, substitute matches with new content, and split on a pattern boundary.
Every placement coding round that involves strings will include at least one problem where the re module is the cleanest solution. The six core functions take about 20 minutes to learn; the real skill is choosing when to use them and when Python’s built-in string methods are enough.
Regex Syntax: The Patterns That Matter Most
A regular expression is a string that describes a pattern. Before touching any re function, you need to read that description language. The Python Regular Expression HOWTO is the authoritative reference; this section covers the patterns that appear in 90% of real problems.
Anchors
Anchors do not match characters. They match positions.
| Anchor | Matches |
|---|---|
^ | Start of the string (or start of each line with re.MULTILINE) |
$ | End of the string (or end of each line with re.MULTILINE) |
\b | Word boundary (between \w and \W) |
A pattern like ^hello$ matches only the exact string "hello", not "say hello" or "hello there".
Character Classes
| Pattern | Matches |
|---|---|
[a-z] | Any lowercase ASCII letter |
[A-Z] | Any uppercase ASCII letter |
[0-9] | Any ASCII digit |
[^aeiou] | Any character that is NOT a vowel (caret inside brackets negates) |
\d | Any digit — equivalent to [0-9] |
\D | Any non-digit |
\w | Any word character: letters, digits, and _ |
\W | Any non-word character |
\s | Any whitespace: space, tab, newline |
\S | Any non-whitespace |
. | Any character except newline (use re.DOTALL to include newline) |
Use raw strings (r"pattern") when writing patterns in Python. Without the r prefix, backslashes need double-escaping: "\\d+" instead of r"\d+". Raw strings are the standard approach.
Quantifiers
Quantifiers specify how many times the preceding element can repeat.
| Quantifier | Meaning |
|---|---|
* | 0 or more |
+ | 1 or more |
? | 0 or 1 (also makes other quantifiers non-greedy when appended) |
{n} | Exactly n times |
{n,m} | Between n and m times (inclusive) |
{n,} | At least n times |
All quantifiers are greedy by default: they match as much text as possible. Add ? after any quantifier to make it non-greedy: +?, *?, {n,m}?.
Groups
(pattern)— capturing group. Stores the matched text and returns it fromfindall()andgroup(n).(?:pattern)— non-capturing group. Groups for quantifier or alternation purposes without storing the match.(?P<name>pattern)— named capturing group. The match is accessible asmatch.group('name').(?=pattern)— lookahead. Asserts the pattern follows at this position without consuming characters.(?!pattern)— negative lookahead. Asserts the pattern does NOT follow.
For character-level string operations that do not need pattern matching, see character classification in Python which covers isupper(), isdigit(), and similar built-ins.
The re Module API: Six Functions
The Python re module documentation lists over a dozen functions, but six cover almost every use case. Here they are with minimal working examples.
re.match() and re.search()
import re
text = "Order placed on 2026-05-11 by user42"
# match() checks at the START of the string only
m = re.match(r"\d{4}-\d{2}-\d{2}", text)
print(m) # None — text does not start with a date
# search() scans the entire string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
print(m.group()) # 2026-05-11
The most common source of bugs with re.match(): it does not anchor to the end of the string. A pattern r"\d+" passed to re.match("123abc") returns a match for "123", not a failure. Add $ explicitly if you need a full-string match: r"^\d+$".
re.findall()
Returns a list of all non-overlapping matches as strings. If the pattern has capturing groups, returns a list of tuples.
import re
log = "Errors: 404 on /api/user, 500 on /api/order, 200 on /health"
# No groups: returns list of matched strings
codes = re.findall(r"\d{3}", log)
print(codes) # ['404', '500', '200']
# One capturing group: returns list of group values
paths = re.findall(r"(\d{3} on \S+)", log)
print(paths) # ['404 on /api/user,', '500 on /api/order,', '200 on /health']
re.finditer()
Returns an iterator of Match objects instead of a list of strings. Use this when you need the position of each match or when the match list could be very large.
import re
text = "call 9876543210 or 9123456789 for support"
for m in re.finditer(r"\b\d{10}\b", text):
print(f"Found {m.group()} at position {m.start()}")
# Found 9876543210 at position 5
# Found 9123456789 at position 19
re.sub()
Substitutes all matches with a replacement string. The replacement can reference captured groups using \1, \2, or \g<name>.
import re
# Redact 10-digit phone numbers in a log file
text = "Contact: 9876543210, backup: 9123456789"
cleaned = re.sub(r"\b\d{10}\b", "[REDACTED]", text)
print(cleaned) # Contact: [REDACTED], backup: [REDACTED]
# Reorder date format from YYYY-MM-DD to DD/MM/YYYY
date_text = "Invoice date: 2026-05-11"
reformatted = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", date_text)
print(reformatted) # Invoice date: 11/05/2026
Pass count=1 to re.sub() to replace only the first occurrence.
re.split()
Splits a string on every occurrence of the pattern. More flexible than str.split() because the delimiter can itself be a pattern.
import re
# Split on one-or-more whitespace characters (handles tabs, multiple spaces)
parts = re.split(r"\s+", " name age score ")
print(parts) # ['', 'name', 'age', 'score', '']
# Split on commas or semicolons
row = "field1;field2,field3;field4"
parts = re.split(r"[;,]", row)
print(parts) # ['field1', 'field2', 'field3', 'field4']
Groups, Named Groups, and Flags
Named Groups
Named groups make patterns with multiple fields easier to maintain. Instead of remembering that group 2 is the month and group 3 is the day, you name them.
import re
pattern = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")
m = pattern.search("Invoice date: 2026-05-11")
if m:
print(m.group("year")) # 2026
print(m.group("month")) # 05
print(m.group("day")) # 11
Named groups also work in re.sub() replacement strings via \g<name>:
reformatted = pattern.sub(r"\g<day>/\g<month>/\g<year>", "Invoice date: 2026-05-11")
print(reformatted) # Invoice date: 11/05/2026
Flags
Flags modify how the pattern engine interprets the string.
| Flag | Short form | Effect |
|---|---|---|
re.IGNORECASE | re.I | Case-insensitive matching |
re.MULTILINE | re.M | ^ and $ match at each line start/end, not just string start/end |
re.DOTALL | re.S | . matches any character including newline |
re.VERBOSE | re.X | Allows whitespace and comments inside the pattern for readability |
Combine flags with the bitwise OR operator: re.IGNORECASE | re.MULTILINE.
re.compile and Performance
Every call to re.search(pattern, text) parses the pattern string before matching. For a single call, that overhead is negligible. In a tight loop over a large file, it adds up.
re.compile() parses the pattern once and returns a compiled pattern object. Call .search(), .findall(), .sub(), or .split() directly on that object.
import re
# Compile once
phone_pattern = re.compile(r"\b[6-9]\d{9}\b")
# Reuse in a loop
with open("contacts.txt") as f:
for line in f:
m = phone_pattern.search(line)
if m:
print(m.group())
The Indian mobile number range starts at 6 ([6-9]) followed by 9 more digits, filtering out numbers outside the mobile range.
When string methods beat regex
Regex carries overhead from pattern parsing and the matching engine. For operations where the delimiter is a fixed string:
text.split(",")is faster thanre.split(r",", text)text.replace("foo", "bar")is faster thanre.sub(r"foo", "bar", text)"pattern" in textis faster thanbool(re.search(r"pattern", text))for literal strings
Reserve regex for cases where the pattern actually varies: optional characters, character classes, quantifiers, or alternatives. For a broader set of Python string manipulation techniques, see string sorting in Python and the Python basic programs collection.
Backtracking pitfall
Patterns with nested quantifiers on overlapping character classes can trigger catastrophic backtracking, where the engine tries an exponential number of paths. The classic example is (a+)+ applied to "aaaaaaaaaaaaaaab". The match fails, but the engine keeps trying every possible grouping of the a characters. Avoid patterns of the form (X+)+ or (X|Y)+ where X and Y can match the same characters.
Real-World Patterns: Phone Numbers, Email, Log Lines
Indian phone number extraction
Indian mobile numbers are 10 digits starting with 6, 7, 8, or 9. Some strings include a +91 country code prefix.
import re
# Matches +91XXXXXXXXXX or plain XXXXXXXXXX
phone_pattern = re.compile(r"(?:\+91[-\s]?)?[6-9]\d{9}\b")
samples = [
"Call us at +91 9876543210 for support",
"Backup: 9123456789",
"Invalid: 12345",
]
for s in samples:
m = phone_pattern.search(s)
if m:
print(m.group())
# +91 9876543210
# 9123456789
# (no output for the invalid line)
Basic email format check
A regex can catch structurally wrong email formats. It is not a substitute for sending a verification email.
import re
email_pattern = re.compile(
r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
)
addresses = ["[email protected]", "bad@", "also.bad", "[email protected]"]
for addr in addresses:
status = "valid format" if email_pattern.match(addr) else "invalid format"
print(f"{addr}: {status}")
# [email protected]: valid format
# bad@: invalid format
# also.bad: invalid format
# [email protected]: valid format
Note: r"^[a-zA-Z0-9._%+\-]+" allows the characters most common in real email addresses. Edge cases in RFC 5321 (quoted strings, IP-address literals) are not handled here. For production validation, use a dedicated library.
Log line parsing
Parsing structured fields from a log line is one of the cleaner regex use cases because the format is fixed and controlled.
import re
log_line = '192.168.1.1 - - [11/May/2026:08:30:00 +0530] "GET /api/data HTTP/1.1" 200 1452'
log_pattern = re.compile(
r"(?P<ip>\d+\.\d+\.\d+\.\d+)"
r".*?"
r'"(?P<method>\w+) (?P<path>\S+)'
r'.*?"'
r"\s(?P<status>\d{3})"
r"\s(?P<size>\d+)"
)
m = log_pattern.search(log_line)
if m:
print(m.group("ip")) # 192.168.1.1
print(m.group("method")) # GET
print(m.group("path")) # /api/data
print(m.group("status")) # 200
print(m.group("size")) # 1452
Named groups make it easy to add fields or reorder the output without renumbering group references.
Regex and LLM Output Parsing
Parsing the output of a language model is pattern matching on free text. Checking whether a response contains a valid JSON block, extracting a structured tag, or confirming a safety prefix exists are all re.search() calls on model output strings. The same patterns from this article apply directly.
TinkerLLM includes exercises where you write re.search() and re.findall() code against live model responses, extracting structured data from unstructured completions. Entry price is ₹299 at tinkerllm.com, browser-based, no setup needed.
Primary sources
Frequently asked questions
What is the difference between re.match() and re.search() in Python?
re.match() checks for a match only at the beginning of the string. re.search() scans the entire string and returns the first match anywhere. If you want to check whether a pattern exists anywhere in the string, use re.search().
How do I make a Python regex case-insensitive?
Pass re.IGNORECASE (or re.I) as the flags argument: re.search(r'python', text, re.IGNORECASE). For compiled patterns, pass it to re.compile(): pattern = re.compile(r'python', re.IGNORECASE).
What does (?:...) mean and how does it differ from (...)?
(?:...) is a non-capturing group. It groups the pattern for quantifier or alternation purposes but does not store the matched text. A plain (...) is a capturing group that stores the match and returns it in findall() results and match.group(n) calls. Use (?:...) when you only need the grouping, not the captured value.
When should I use re.compile() instead of calling re.search() directly?
Use re.compile() when the same pattern is applied more than once, especially inside a loop over many strings. The compiled pattern object caches the parsed regex internally, avoiding repeated parsing overhead. For a one-off search in a short script, re.search(pattern, text) is fine.
Why does my greedy regex match more text than expected?
Quantifiers like *, +, and ? are greedy by default and match as much text as possible. To make them non-greedy (match as little as possible), append a question mark: *?, +?, ??. For example, r'<.*>' matches the entire string 'bolditalic', while r'<.*?>' matches only ''.
Is regex good for validating email addresses in Python?
A basic regex like r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' catches obviously invalid formats. For production systems, use the email-validator library or Python's email.headerregistry module instead, as RFC 5321 edge cases are complex enough that hand-rolled regex misses them.
A self-paced playground for building with LLMs.
TinkerLLM is FACE Prep's sister property. A guided environment for shipping real LLM applications, the kind of project that earns a paragraph on your resume, not a line.
Try TinkerLLM (₹299 launch)