Python regex quick reference

Aug. 9, 2020

Back-End

Search Patterns

Regex pattern	Match
^	Beginning of the string
$	End of the string

[a-e]	= [abcde]
[0-5]	= [012345]
[A-Z]	= [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
[A-Za-z]	= all letters
[-az] or [az-]	= "-" or "a" or "z"
[-a-z]	= "-" or "a...z"
[^abc]	= not ("a", "b" or "c")
[a^bc]	= "a", "b", "c" or "^"

()	defining a group
.	Any character

>>> re.findall(r"^A(.*)B","A123B")
['123']

a|b

a or b

>>> re.findall("1(a|b)2","001a20001b20")
['a', 'b']

a{4}	Exactly 4 a's
a{4,8}	Between (inclusive) 4 and 8 a's
a{9,}	9 or more a's
?	match 0 or 1 repetitions of the preceding re ab? will match either ‘a’ or ‘ab’.

>>> re.findall("ab?","123abaacd")
['ab', 'a', 'a']

*	match 0 or more repetitions of the preceding re, ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

>>> re.findall("ab*","123abbbbabacd") ['abbbb', 'ab', 'a']

+	match 1 or more repetitions of the preceding re . ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

>>> re.findall("ab+","123abbbbabacd")
['abbbb', 'ab']

?, *, +

are "greedy" patterns: they match as much text as possible.
Adding a '?' makes them "non-greedy": as few characters as possible will be matched.

>>> re.findall(r"<.*>","<a> b <c>")
['<a> b <c>']
>>> re.findall(r"<.*>","<a> b <c")
['<a>']
>>> re.findall(r"<.*?>","<a> b <c>")
['<a>', '<c>']

\d	Any decimal digit: [0-9]
\D	complement of \d. Any non-digit character: [^0-9]
\s	Any whitespace character: [ \t\n\r\f\v]
\S	Complement of \s. Any non-whitespace character: [^ \t\n\r\f\v]
\w	Any alphanumeric character: [a-zA-Z0-9_]
\W	Complement of \w
\b	A word boundary (empty string, but only at the start or end of a word)
\B	A non-word boundary (empty string, but not at the start or end of a word)

Escape Sequences in Strings

Escape Sequence	Meaning Notes
\newline	Ignored
\\	Backslash (\)
\'	Single quote (')
\"	Double quote (")
\a	ASCII Bell (BEL)
\b	ASCII Backspace (BS)
\f	ASCII Formfeed (FF)
\n	ASCII Linefeed (LF)
\N{name}	Character named name in the Unicode database (Unicode only)
\r	ASCII Carriage Return (CR)
\t	ASCII Horizontal Tab (TAB)
\uxxxx	Character with 16-bit hex value xxxx (Unicode only)
\Uxxxxxxxx	Character with 32-bit hex value xxxxxxxx (Unicode only)
\v	ASCII Vertical Tab (VT)
\ooo	Character with octal value ooo
\xhh	Character with hex value hh

Groups

(?P...)	define a capturing group named 'name'
(?P=name)	refer to the captured group named 'name'
\n	the n'th captured group
(?#...)	a comment

>>> match = re.search(r"<([a-z]+)>(.*)</\1>","<name>Samuel</name>")
>>> match.group(1)
'name'
>>> match.group(2)
'Samuel'

>>> re.findall(r"<([a-z]+)>(.*)</\1>","<name>Samuel</name>")
[('name', 'Samuel')]

>>> m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

>>> re.findall(r"<(?P<tag>[a-z]+)>(.*)</(?P=tag)>", "<name>Malcolm Reynolds</name>")
[('name', 'Malcolm Reynolds')]

(?=...)

positive lookahead

>>> re.findall('abc (?=def)', 'abc def')
['abc ']

(?!...)

negative lookahead

>>> re.findall('abc(?!def)', 'abcde')
['abc']

(?<=...)

positive lookbehind

>>> re.findall('(?<=abc)def', 'abcdef')
['def']

>>> re.findall(r'(?<=-)\w+', 'spam-egg')
['egg']

>>> re.findall(r'(?<=:).*\.(?#find the list)', 'This is an list: 1, 2, 3, 4 .')
[' 1, 2, 3, 4 .']

(?<!...)

negative lookbehind

>>> re.findall('(?<!abc)def', 'abcdef defabc')
['def']

Example usage:

>>> import re
>>> match = re.search(r"at","A cat in a hat.")
>>> match
<_sre.SRE_Match object; span=(3, 5), match='at'>
>>> match = re.search(r"(at)","A cat in a hat.")
>>> m.group(1)
'at'
>>> m.group(0)
'at'
>>> m.span()
(3, 5)
>>> m.start()
3
>>> m.end()
5

>>> re.findall(r"at","A cat in a hat.")
['at', 'at']

>>> re.sub("at","**", "A cat in a hat.")
'A c** in a h**.'

>>> compiled_re = re.compile("at")
>>> compiled_re.search("A cat in a hat.")
<_sre.SRE_Match object; span=(3, 5), match='at'>

>>> re.findall(r"at","A cat in a hat./nA rAt and a bAt.", re.IGNORECASE)
['at', 'at', 'At', 'At']

>>> "A cat  and  a  \n  rat".split(" ")
['A', 'cat', '', 'and', '', 'a', '', '\n', '', 'rat']
>>> "A cat  and  a  \n  rat".split(None)
['A', 'cat', 'and', 'a', 'rat']
>>> "A cat  and  a  \n  rat".split(r"at")
['A c', '  and  a  \n  r', '']

Abbreviation	Full name	Description
re.I	re.IGNORECASE	Makes the regular expression case-insensitive
re.L	re.LOCALE	The behaviour of some special sequences like \w, \W, \b,\s, \S will be made dependant on the current locale, i.e. the user's language, country aso.
re.M	re.MULTILINE	^ and $ will match at the beginning and at the end of each line and not just at the beginning and the end of the string
re.S	re.DOTALL	The dot "." will match every character plus the newline
re.U	re.UNICODE	Makes \w, \W, \b, \B, \d, \D, \s, \S dependent on Unicode character properties
re.X	re.VERBOSE	Allowing "verbose regular expressions", i.e. whitespace are ignored. This means that spaces, tabs, and carriage returns are not matched as such. If you want to match a space in a verbose regular expression, you'll need to escape it by escaping it with a backslash in front of it or include it in a character class. # are also ignored, except when in a character class or preceded by an non-escaped backslash. Everything following a "#" will be ignored until the end of the line, so this character can be used to start a comment.

Links:

https://www.python-course.eu/python3_re.php

https://www.python-course.eu/python3_re_advanced.php

https://github.com/amalshehu/legendary-regex/blob/master/README.md

https://www.rexegg.com/regex-lookarounds.html

https://www.regular-expressions.info/lookaround.html

Python regex quick reference

Escape Sequences in Strings

Groups

Tags