Placement Prep

Regular Expressions in Python: A Complete Guide

Python's re module: core syntax, API functions (search, findall, sub, split), compiled patterns, and real-world email and log examples for placement coding rounds.

By FACE Prep Team 7 min read
python regex regular-expressions placement-prep coding-questions string-processing python-programs

Python’s re module covers four operations on strings: test whether a pattern matches, extract the matching text, substitute matches with new content, and split on a pattern boundary.

Every placement coding round that involves strings will include at least one problem where the re module is the cleanest solution. The six core functions take about 20 minutes to learn; the real skill is choosing when to use them and when Python’s built-in string methods are enough.

Regex Syntax: The Patterns That Matter Most

A regular expression is a string that describes a pattern. Before touching any re function, you need to read that description language. The Python Regular Expression HOWTO is the authoritative reference; this section covers the patterns that appear in 90% of real problems.

Anchors

Anchors do not match characters. They match positions.

AnchorMatches
^Start of the string (or start of each line with re.MULTILINE)
$End of the string (or end of each line with re.MULTILINE)
\bWord boundary (between \w and \W)

A pattern like ^hello$ matches only the exact string "hello", not "say hello" or "hello there".

Character Classes

PatternMatches
[a-z]Any lowercase ASCII letter
[A-Z]Any uppercase ASCII letter
[0-9]Any ASCII digit
[^aeiou]Any character that is NOT a vowel (caret inside brackets negates)
\dAny digit — equivalent to [0-9]
\DAny non-digit
\wAny word character: letters, digits, and _
\WAny non-word character
\sAny whitespace: space, tab, newline
\SAny non-whitespace
.Any character except newline (use re.DOTALL to include newline)

Use raw strings (r"pattern") when writing patterns in Python. Without the r prefix, backslashes need double-escaping: "\\d+" instead of r"\d+". Raw strings are the standard approach.

Quantifiers

Quantifiers specify how many times the preceding element can repeat.

QuantifierMeaning
*0 or more
+1 or more
?0 or 1 (also makes other quantifiers non-greedy when appended)
{n}Exactly n times
{n,m}Between n and m times (inclusive)
{n,}At least n times

All quantifiers are greedy by default: they match as much text as possible. Add ? after any quantifier to make it non-greedy: +?, *?, {n,m}?.

Groups

  • (pattern) — capturing group. Stores the matched text and returns it from findall() and group(n).
  • (?:pattern) — non-capturing group. Groups for quantifier or alternation purposes without storing the match.
  • (?P<name>pattern) — named capturing group. The match is accessible as match.group('name').
  • (?=pattern) — lookahead. Asserts the pattern follows at this position without consuming characters.
  • (?!pattern) — negative lookahead. Asserts the pattern does NOT follow.

For character-level string operations that do not need pattern matching, see character classification in Python which covers isupper(), isdigit(), and similar built-ins.

The re Module API: Six Functions

The Python re module documentation lists over a dozen functions, but six cover almost every use case. Here they are with minimal working examples.

re.match() and re.search()

import re

text = "Order placed on 2026-05-11 by user42"

# match() checks at the START of the string only
m = re.match(r"\d{4}-\d{2}-\d{2}", text)
print(m)  # None — text does not start with a date

# search() scans the entire string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
print(m.group())  # 2026-05-11

The most common source of bugs with re.match(): it does not anchor to the end of the string. A pattern r"\d+" passed to re.match("123abc") returns a match for "123", not a failure. Add $ explicitly if you need a full-string match: r"^\d+$".

re.findall()

Returns a list of all non-overlapping matches as strings. If the pattern has capturing groups, returns a list of tuples.

import re

log = "Errors: 404 on /api/user, 500 on /api/order, 200 on /health"

# No groups: returns list of matched strings
codes = re.findall(r"\d{3}", log)
print(codes)  # ['404', '500', '200']

# One capturing group: returns list of group values
paths = re.findall(r"(\d{3} on \S+)", log)
print(paths)  # ['404 on /api/user,', '500 on /api/order,', '200 on /health']

re.finditer()

Returns an iterator of Match objects instead of a list of strings. Use this when you need the position of each match or when the match list could be very large.

import re

text = "call 9876543210 or 9123456789 for support"

for m in re.finditer(r"\b\d{10}\b", text):
    print(f"Found {m.group()} at position {m.start()}")
# Found 9876543210 at position 5
# Found 9123456789 at position 19

re.sub()

Substitutes all matches with a replacement string. The replacement can reference captured groups using \1, \2, or \g<name>.

import re

# Redact 10-digit phone numbers in a log file
text = "Contact: 9876543210, backup: 9123456789"
cleaned = re.sub(r"\b\d{10}\b", "[REDACTED]", text)
print(cleaned)  # Contact: [REDACTED], backup: [REDACTED]

# Reorder date format from YYYY-MM-DD to DD/MM/YYYY
date_text = "Invoice date: 2026-05-11"
reformatted = re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\3/\2/\1", date_text)
print(reformatted)  # Invoice date: 11/05/2026

Pass count=1 to re.sub() to replace only the first occurrence.

re.split()

Splits a string on every occurrence of the pattern. More flexible than str.split() because the delimiter can itself be a pattern.

import re

# Split on one-or-more whitespace characters (handles tabs, multiple spaces)
parts = re.split(r"\s+", "  name   age   score  ")
print(parts)  # ['', 'name', 'age', 'score', '']

# Split on commas or semicolons
row = "field1;field2,field3;field4"
parts = re.split(r"[;,]", row)
print(parts)  # ['field1', 'field2', 'field3', 'field4']

Groups, Named Groups, and Flags

Named Groups

Named groups make patterns with multiple fields easier to maintain. Instead of remembering that group 2 is the month and group 3 is the day, you name them.

import re

pattern = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")

m = pattern.search("Invoice date: 2026-05-11")
if m:
    print(m.group("year"))   # 2026
    print(m.group("month"))  # 05
    print(m.group("day"))    # 11

Named groups also work in re.sub() replacement strings via \g<name>:

reformatted = pattern.sub(r"\g<day>/\g<month>/\g<year>", "Invoice date: 2026-05-11")
print(reformatted)  # Invoice date: 11/05/2026

Flags

Flags modify how the pattern engine interprets the string.

FlagShort formEffect
re.IGNORECASEre.ICase-insensitive matching
re.MULTILINEre.M^ and $ match at each line start/end, not just string start/end
re.DOTALLre.S. matches any character including newline
re.VERBOSEre.XAllows whitespace and comments inside the pattern for readability

Combine flags with the bitwise OR operator: re.IGNORECASE | re.MULTILINE.

re.compile and Performance

Every call to re.search(pattern, text) parses the pattern string before matching. For a single call, that overhead is negligible. In a tight loop over a large file, it adds up.

re.compile() parses the pattern once and returns a compiled pattern object. Call .search(), .findall(), .sub(), or .split() directly on that object.

import re

# Compile once
phone_pattern = re.compile(r"\b[6-9]\d{9}\b")

# Reuse in a loop
with open("contacts.txt") as f:
    for line in f:
        m = phone_pattern.search(line)
        if m:
            print(m.group())

The Indian mobile number range starts at 6 ([6-9]) followed by 9 more digits, filtering out numbers outside the mobile range.

When string methods beat regex

Regex carries overhead from pattern parsing and the matching engine. For operations where the delimiter is a fixed string:

  • text.split(",") is faster than re.split(r",", text)
  • text.replace("foo", "bar") is faster than re.sub(r"foo", "bar", text)
  • "pattern" in text is faster than bool(re.search(r"pattern", text)) for literal strings

Reserve regex for cases where the pattern actually varies: optional characters, character classes, quantifiers, or alternatives. For a broader set of Python string manipulation techniques, see string sorting in Python and the Python basic programs collection.

Backtracking pitfall

Patterns with nested quantifiers on overlapping character classes can trigger catastrophic backtracking, where the engine tries an exponential number of paths. The classic example is (a+)+ applied to "aaaaaaaaaaaaaaab". The match fails, but the engine keeps trying every possible grouping of the a characters. Avoid patterns of the form (X+)+ or (X|Y)+ where X and Y can match the same characters.

Real-World Patterns: Phone Numbers, Email, Log Lines

Indian phone number extraction

Indian mobile numbers are 10 digits starting with 6, 7, 8, or 9. Some strings include a +91 country code prefix.

import re

# Matches +91XXXXXXXXXX or plain XXXXXXXXXX
phone_pattern = re.compile(r"(?:\+91[-\s]?)?[6-9]\d{9}\b")

samples = [
    "Call us at +91 9876543210 for support",
    "Backup: 9123456789",
    "Invalid: 12345",
]

for s in samples:
    m = phone_pattern.search(s)
    if m:
        print(m.group())
# +91 9876543210
# 9123456789
# (no output for the invalid line)

Basic email format check

A regex can catch structurally wrong email formats. It is not a substitute for sending a verification email.

import re

email_pattern = re.compile(
    r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
)

addresses = ["[email protected]", "bad@", "also.bad", "[email protected]"]
for addr in addresses:
    status = "valid format" if email_pattern.match(addr) else "invalid format"
    print(f"{addr}: {status}")
# [email protected]: valid format
# bad@: invalid format
# also.bad: invalid format
# [email protected]: valid format

Note: r"^[a-zA-Z0-9._%+\-]+" allows the characters most common in real email addresses. Edge cases in RFC 5321 (quoted strings, IP-address literals) are not handled here. For production validation, use a dedicated library.

Log line parsing

Parsing structured fields from a log line is one of the cleaner regex use cases because the format is fixed and controlled.

import re

log_line = '192.168.1.1 - - [11/May/2026:08:30:00 +0530] "GET /api/data HTTP/1.1" 200 1452'

log_pattern = re.compile(
    r"(?P<ip>\d+\.\d+\.\d+\.\d+)"
    r".*?"
    r'"(?P<method>\w+) (?P<path>\S+)'
    r'.*?"'
    r"\s(?P<status>\d{3})"
    r"\s(?P<size>\d+)"
)

m = log_pattern.search(log_line)
if m:
    print(m.group("ip"))      # 192.168.1.1
    print(m.group("method"))  # GET
    print(m.group("path"))    # /api/data
    print(m.group("status"))  # 200
    print(m.group("size"))    # 1452

Named groups make it easy to add fields or reorder the output without renumbering group references.

Regex and LLM Output Parsing

Parsing the output of a language model is pattern matching on free text. Checking whether a response contains a valid JSON block, extracting a structured tag, or confirming a safety prefix exists are all re.search() calls on model output strings. The same patterns from this article apply directly.

TinkerLLM includes exercises where you write re.search() and re.findall() code against live model responses, extracting structured data from unstructured completions. Entry price is ₹299 at tinkerllm.com, browser-based, no setup needed.

Primary sources

Frequently asked questions

What is the difference between re.match() and re.search() in Python?

re.match() checks for a match only at the beginning of the string. re.search() scans the entire string and returns the first match anywhere. If you want to check whether a pattern exists anywhere in the string, use re.search().

How do I make a Python regex case-insensitive?

Pass re.IGNORECASE (or re.I) as the flags argument: re.search(r'python', text, re.IGNORECASE). For compiled patterns, pass it to re.compile(): pattern = re.compile(r'python', re.IGNORECASE).

What does (?:...) mean and how does it differ from (...)?

(?:...) is a non-capturing group. It groups the pattern for quantifier or alternation purposes but does not store the matched text. A plain (...) is a capturing group that stores the match and returns it in findall() results and match.group(n) calls. Use (?:...) when you only need the grouping, not the captured value.

When should I use re.compile() instead of calling re.search() directly?

Use re.compile() when the same pattern is applied more than once, especially inside a loop over many strings. The compiled pattern object caches the parsed regex internally, avoiding repeated parsing overhead. For a one-off search in a short script, re.search(pattern, text) is fine.

Why does my greedy regex match more text than expected?

Quantifiers like *, +, and ? are greedy by default and match as much text as possible. To make them non-greedy (match as little as possible), append a question mark: *?, +?, ??. For example, r'<.*>' matches the entire string 'bolditalic', while r'<.*?>' matches only ''.

Is regex good for validating email addresses in Python?

A basic regex like r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' catches obviously invalid formats. For production systems, use the email-validator library or Python's email.headerregistry module instead, as RFC 5321 edge cases are complex enough that hand-rolled regex misses them.

Build AI projects

A self-paced playground for building with LLMs.

TinkerLLM is FACE Prep's sister property. A guided environment for shipping real LLM applications, the kind of project that earns a paragraph on your resume, not a line.

Try TinkerLLM (₹299 launch)
Free AI Roadmap PDF