Understanding Regex in Python — a full understanding of the functions and how they work

DC
8 min readMar 2, 2020

One of the struggles I have had as a developer is getting a full understanding on Regex.

You have probably in your development ran across regular expressions several times and were confused when seeing the daunting set of characters grouped together like this:

/^\w+([\.-]?\w)+@\w+([\.]?\w)+(\.[a-zA-Z]{2,3})+$/

You probably asked yourself. What is this above?

In this article, I would like to walk through the basics of understanding how regex works and how you can start utilizing it in your projects yourself. In the past whenever I have tried to manipulate data that I am working with I have endlessly googled how to manipulate the data the way I needed it. Most of those times the result would lay in some borrowed code utilizing regex. I would copy it to my code and the next time I needed to do something similar again the google search began.

Regular expressions (Regex or RegExp) are extremely useful in taking you to the next level when improving your algorithm game and will help you solve many problems quicker and more efficiently. The structure of regular expressions can be intimidating, but can be very rewarding once you grasp how to implement the patterns and can get them to work properly.

What is Regex?

Regex is a type of object, that can help you extract information from any type of string data by searching through the text to find exactly what you need. You can find thinks like numbers, letters, punctuation's or white spaces. It allows you to check and match any character combination in strings.

Some examples where this can be useful is matching phone number or email addresses. You can check for patterns present in the Regex and replace or even replace or validate substrings. The Regex is like your own search bar that you can define the criteria that meets your needs and assist you in finding what you need.

How to create a Regex?

Python has a built-in package for handling Regex called the re module.

The first thing you must do is import the module:

import re

The following example below uses regex to find the “The” and ends with Spain in a sentence. Using the search method will return a true or false if the cases are true.

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

Functions of Regex

Findall()

Using the findall() returns a List of the characters found in a string.

txt = "The rain in Spain"
x = re.findall("ai", txt)

Returns: [“ai”, “ai”]

The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

txt = "The rain in Spain"
x = re.findall("Portugal", txt)

Returns: []

search()

The search() function searches the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned.

The search below returns the character that the first space occurs at:

txt = "The rain in Spain"
x = re.search("\s", txt)
x.start()

Returns: 3

If no match is found None will be returned.

txt = "The rain in Spain"
x = re.search("Portugal", txt)

Returns: None

The search function contains a Match object. A Match Object is an object containing information about the search and the result.

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

Returns: <_sre.SRE_Match object; span=(5, 7), match=’ai’>

The object has properties and methods used to retrieve information about the search, and the result:

.span() returns a tuple containing the start-, and end positions of the match.
.string returns the string passed into the function
.group() returns the part of the string where there was a match

split()

The split() function returns a list where the string has been split at each match.

txt = "The rain in Spain"
x = re.split("\s", txt, 1)

sub()

The sub() function replaces or substitutes the matches with the text of your choice:

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)

Returns: “The9rain9in9Spain”

The sub() functions allows you to control the amount of replacements by using the count() parameter.

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)

Returns: “The9rain9in Spain”

Metacharacters, Special Sequences and Sets

While we have only touched the surface on the power of Regex, I do hope that a demonstration of some of the features give you the true power of regex. We presented many functions utilizing white space or specific text but the true power also is in the different meta characters and special sequences that can be used to truly search the strings in more elaborate ways.

Metacharacters

The [] allows for a set of characters to be searched for.

txt = "The rain in Spain"#Find all lower case characters alphabetically between "a" and "m":x = re.findall("[a-m]", txt)

Returns

['h', 'e', 'a', 'i', 'i', 'a', 'i']

The \ signals a special sequence (can also be used to escape special characters).

txt = "That will be 59 dollars"#Find all digit characters:x = re.findall("\d", txt)

Returns

['5', '9']

The . allows for any character (except newline character).

txt = "hello world"#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":x = re.findall("he..o", txt)

Returns

['hello']

The ^ looks for characters that starts with.

txt = "hello world"#Check if the string starts with 'hello':x = re.findall("^hello", txt)

Returns

True

The $ looks for characters that ends with.

txt = "hello world"#Check if the string ends with 'world':x = re.findall("world$", txt)

Returns

True

The * looks for zero or more occurrences.

txt = "The rain in Spain falls mainly in the plain!"#Check if the string contains "ai" followed by 0 or more "x" characters:x = re.findall("aix*", txt)

Returns

['ai', 'ai', 'ai', 'ai']

The + looks for one or more occurrences. An example is ”aix+”

txt = "The rain in Spain falls mainly in the plain!"#Check if the string contains "ai" followed by 1 or more "x" characters:x = re.findall("aix+", txt)

Returns

[]

The {} looks for exactly the specified number of occurrences.

txt = "The rain in Spain falls mainly in the plain!"#Check if the string contains "a" followed by exactly two "l" characters:x = re.findall("al{2}", txt)

Returns

['all']

The| looks for either or of the values. An example is ”falls|stays”

txt = "The rain in Spain falls mainly in the plain!"#Check if the string contains either "falls" or "stays":x = re.findall("falls|stays", txt)

Returns

['falls']

Special Sequences

A special sequence is a \ followed by one of the characters below that has special meaning.

The \A will return a match if the specified characters are at the beginning of the string. For example the following will return true.

txt = "The rain in Spain"#Check if the string starts with "The":x = re.findall("\AThe", txt)

The \b will return a match where the specified characters are at the beginning or end of a word.

txt = "The rain in Spain"#Check if "ain" is present at the beginning of a WORD:x = re.findall(r"\bain", txt)

Example at the end of a word.

txt = “The rain in Spain”#Check if “ain” is present at the end of a WORD:x = re.findall(r”ain\b”, txt)

The \B will return a match where the specified characters are present, but NOT at the beginning (or at the end) of a word.

Example showing where no match would be found.

txt = "The rain in Spain"#Check if "ain" is present at the beginning of a WORD:x = re.findall(r"\bain", txt)

Example where a match would not be found at the end.

txt = "The rain in Spain"#Check if "ain" is present, but NOT at the end of a word:x = re.findall(r"ain\B", txt)

The \d will return a match where the string contains digits (numbers from 0–9)

Example returns no match.

txt = "The rain in Spain"#Check if the string contains any digits (numbers from 0-9):x = re.findall("\d", txt)

The \D will return a match where the string DOES NOT contain digits.

The example below returns true.

txt = "The rain in Spain"#Return a match at every no-digit character:x = re.findall("\D", txt)

The \s will return a match where the string contains a white space character

txt = "The rain in Spain"#Return a match at every white-space character:x = re.findall("\s", txt)

Returns

[' ', ' ', ' ']

The \S will return a match when the String Does Not contain a white space character.

The example below returns match at every NON white space character.

txt = "The rain in Spain"#Return a match at every NON white-space character:x = re.findall("\S", txt)

Returns

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']

The \w will return a match where the string contains any word characters (characters from a to Z, digits from 0–9, and the underscore _ character)

txt = "The rain in Spain"#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):x = re.findall("\w", txt)

Returns

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']

The \W returns a match where the string DOES NOT contain any word characters.

txt = "The rain in Spain"#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):x = re.findall("\W", txt)

Returns

[' ', ' ', ' ']

The \Z will return a match if the specified characters are at the end of the string.

txt = "The rain in Spain"#Check if the string ends with "Spain":x = re.findall("Spain\Z", txt)

Returns

['Spain']

Sets

The [arn] will return a match where one of the specified characters (a, r, or n) are present.

txt = "The rain in Spain"#Check if the string has any a, r, or n characters:x = re.findall("[arn]", txt)

Returns

['r', 'a', 'n', 'n', 'a', 'n']

The [a-n] will returns a match for any lower case character, alphabetically between a and n

txt = "The rain in Spain"#Check if the string has any characters between a and n:x = re.findall("[a-n]", txt)

Returns

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']

The [^arn] will returns a match for any character EXCEPT a, r, and n

txt = "The rain in Spain"#Check if the string has other characters than a, r, or n:x = re.findall("[^arn]", txt)

Returns

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']

The [0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present

txt = "The rain in Spain"#Check if the string has any 0, 1, 2, or 3 digits:x = re.findall("[0123]", txt)

Returns

[]

The [0–9] Returns a match for any digit between 0 and 9

txt = "8 times before 11:45 AM"#Check if the string has any digits:x = re.findall("[0-9]", txt)

Returns

['8', '1', '1', '4', '5']

The [0–5][0–9]Returns a match for any two-digit numbers from 00 and 59

txt = "8 times before 11:45 AM"#Check if the string has any two-digit numbers, from 00 to 59:x = re.findall("[0-5][0-9]", txt)

Returns

['11', '45']

The [a-zA-Z]Returns a match for any character alphabetically between a and z, lower case OR upper case

txt = "8 times before 11:45 AM"#Check if the string has any characters from a to z lower case, and A to Z upper case:x = re.findall("[a-zA-Z]", txt)

Returns

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']

[+]In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

txt = "8 times before 11:45 AM"#Check if the string has any + characters:x = re.findall("[+]", txt)

Returns

[]

--

--

DC

Data science and artificial intelligence influencer who’s focus has been on the journey and education of others on the field of AI/ML for learning.