search iconsearch icon
Type something to search...

Using regexs with python

Using regexs with python

1. Regex basics

A regular expression, or regex for short, is a pattern describing a certain amount of text. The most simple regex is a regular string, like hello which will match the literal text hello.

There are some characters with special meanings. For example . will match any character (except line break). As an example ip.d will match ipad, ipod, ip@d etc.

If we want to only match ipod or ipad we can use ip[ao]d.

1.1. Brackets []

Brackets ([ and ]) are used for specifying which characters are allowed. It is similar to an or operator. As an example:

  • [abc]: a, b or c
  • [a-c]: a, b or c
  • [^a-c]: anything but a, b or c
  • [0-9a-fA-F]: any number or any letter between a and f or between A and F. This represents hexadecimal

1.1. Position

It is possible to specify matches based on the position with:

  • ^: Start
  • $: End

For example ^ab will only match the first ab in ab abc ab (Domain LogoDemo)

1.2. Classes

There are certain keywords that will match different types of characters. For example \d will match any digit (any of those 0123456789). Those classes can be inversed with capital letters. As an example \D will match any character except a digit.

The main classes available are:

regexdescriptionexampleinverse
\dDigitsDomain Logod_example\D
\wAlphanumeric and/or _Domain Logow_example\W
\sWhitespaces and equivalents (like tabs)Domain Logos_example\S

1.3. Quantifiers

It is possible to specify how many times a letter or group (defined in the next section) should appear.

As an example we want the letter a + the letter b which should appear different number of times:

regexdescriptionexample
ab*a + b zero times or moreDomain Logoexample_ab*
ab+a + b one time or moreDomain Logoexample_ab+
ab?a + optional b (zero or one time)Domain Logoexample_ab?
ab{2}a + b exactly 2 timesDomain Logoexample_ab{2}
ab{2,5}a + b between 2 and 5 times (inclusive)Domain Logoexample_ab{2,5}

1.4. Groups

It is possible to define groups in regex. They are use to capture part of the text and are declared with parenthesis (). We can use any regex inside the group. For example if we want to math hello:

  • (hello): capturing group (Domain LogoDemo)
  • (?:hello): non-capturing group
  • (?P<name>hello): same but giving the group a name (name in this case)
  • (hello|bye): or operator in groups.

1.5. Boundaries

It is possible to declare a boundary with \b (with the inverse \B). A boundary means that is surrounded with a non word type character. It is similar to the start (^) and end ($).

It is easier to understand it with some examples:

regextext: Impossible to dotext: I'm possible :)text: possiblerexample
possibleImpossible to doI’m possible :)possiblerDomain Logoboundaries_1
\bpossibleImpossible to doI’m possible :)possiblerDomain Logoboundaries_2
\bpossible\bImpossible to doI’m possible :)possiblerDomain Logoboundaries_3
possible\BImpossible to doI’m possible :)possiblerDomain Logoboundaries_4
\BpossibleImpossible to doI’m possible :)possiblerDomain Logoboundaries_5

1.6. Flags

It is possible to change the behaviour of the regex with some flags:

  • (?i): case insensitive. For example (?i)hello will match HELLO (Domain LogoDemo)
  • (?m): multi line. When this is active the anchor ^ will be start of line instead of start of text (Domain LogoDemo)
  • (?s): single line.

1.7. Greedy vs lazy

Imagine that we have the following text:

<h1> Title </h1>
<p> Hello </p>

If we want to capture the start of an html tag (<h1> and <p> in this case) one might be tempted to use something like <\w+>. However that will match the whole line (for both lines) since we are specifying < followed by any text and ended with > (Domain LogoDemo). If we want to stop it at the first appearence of > we can use the non greedy regex <.+?> (Domain LogoDemo).

Some better alternatives are <\w+> or <[^<>]+>.

To sum up:

  • <.+>: greedy
  • <.+?>: non greedy

It can also be used with other quantifiers that are not the +.

2. Regex with python

2.1. re package

To use regex in python you first need to import the re package with

import re

Then to check if there is a match in a text with one regex using out = re.match(regex, text):

out = re.match(r"hello", "hello world")

# out is a class if we only want to know if there is a match we can use `bool(out)`

When declaring regexs in python it is a good practise to declare them as raw strings with r"regex"

We have use re.match since it’s faster. However this only looks at the begining of the string. To do a more in deep search we can use re.search.

Let’s compare them by doing checking different regexs against the following text:

text = """hello world
bye world"""

The results would be:

regexre.matchre.search
r"hello"✔️✔️
r"world"✔️
r"bye"✔️

2.3. re.compile

If we plan to reuse the same regex we can compile it with regex = re.compile(regex):

regex = re.compile(r"hello")

And then we can use both match and/or search with:

regex.match("hello world")

regex.search("hello world")

The advantage is that a compiled regex is faster than a non-compiled one.

2.4. Working with groups

Imagine that we have the following file names:

filenames = [
    "Informe mensual Indexa Capital - AABBCCDD - 2020-01.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-01.pdf",
    "Informe mensual Indexa Capital - AABBCCDD - 2021-03.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-12.pdf",
]

2.4.1. Unnamed groups

It looks more or less clear that we can extract some data from those names:

regex = re.compile(r"Informe mensual Indexa Capital - (\w{8}) - (\d{4})-(\d{2}).pdf")

out = regex.match(filenames[0])
out.groups()

> (‘AABBCCDD’, ‘2020’, ‘01’)

2.4.2. Named groups

We can use named groups to better extract the data

regex = re.compile(r"Informe mensual Indexa Capital - (?P<account>\w{8}) - (?P<year>\d{4})-(?P<month>\d{2}).pdf")

out = regex.match(filenames[1])
out.groupdict()

> {'account': 'XXYYZZWW', 'year': '2020', 'month': '01'}

Clearly we can use that to rename the files doing something like:

out_pattern = "{year}_{month} Indexa capital {account}.pdf"

for filename in filenames:
    groups = regex.match(filename).groupdict()
    new_name = out_pattern.format(**groups)
  
    print(new_name)

2020_01 Indexa capital AABBCCDD.pdf 2020_01 Indexa capital XXYYZZWW.pdf 2021_03 Indexa capital AABBCCDD.pdf 2020_12 Indexa capital XXYYZZWW.pdf

3. Regex with pandas

There are multiple functions in pandas that can be used with regex. They are:

  1. count: count occurrences that match a regex
  2. replace: replace based on a regex
  3. match: same as re.match
  4. contains: same as re.search
  5. findall: same as re.findall
  6. split / rsplit: splits text based on a regex
  7. extract: extract groups using a regex

In order to explain them let’s create a really dummy dataframe with:

import pandas as pd

queries = [
    "select *  FROM  users",
    "SELECT count(id) AS cnt FROM public.users",
    "SELECT * FROM orders",
    "SELECT city, count(1) FROM orders GROUP BY 1",
]
df = pd.DataFrame(queries, columns=["query"])

3.1. series.str.count

In this case we want to count the number of queries that come from the users table. It doesn’t matter if the schema is specified or not. It can be done with:

df["query"].str.count(r"\s+FROM\s+(\w+\.)?users\b")

1 1 0 0

3.2. series.str.replace

We can use this for deleting the renamings (like AS cnt):

df["query"].str.replace(r"\s+AS\s+\w+\b", "", regex=True)

select * FROM users SELECT count(id) FROM public.users SELECT * FROM orders SELECT city, count(1) FROM orders GROUP BY 1

3.3. series.str.match

In this case we can look which queries are selecting all columns (SELECT * FROM):

df["query"].str.match(r"(?i)(SELECT)\s+\*\s+(?i)(FROM)")

True False True False

3.4. series.str.contains

Similar to the replace example, let’s get the queries that have a rename AS x:

df["query"].str.contains(r"\s+AS\s+\w+")

False True False False

3.5. series.str.findall

We can use this for getting a list with all words that have between 5 and 8 letters:

df["query"].str.findall(r"\w{5,8}")

[select, users] [SELECT, count, public, users] [SELECT, orders] [SELECT, count, orders, GROUP]

Note that the output is one list per row.

3.6. series.str.split / series.str.rsplit

In this case let’s split the query in 2 parts, all before the FROM and all after it:

df["query"].str.split(r"\s+(?i)from\s+")

[select *, users] [SELECT count(id) AS cnt, public.users] [SELECT *, orders] [SELECT city, count(1), orders GROUP BY 1]

3.7. series.str.extract

With this function we can extract data from groups. The output of this function is a DataFrame where each group is a column.

regex = r"(?i)SELECT\s+(?:.+)\s+(?i)FROM\s+(?P<schema>\w+\.)?(?P<table>\w+)\b"
df["query"].str.extract(regex)

The output in this case is:

schematable
Nanusers
public.users
Nanorders
Nanorders

The regex used in this example might be a little bit complex, let’s analyze it by parts:

  • (?i)SELECT: SELECT statement case insensitive
  • \s+: one or more spaces
  • (?:.+): any number of characters in a non-capturing group (so that it does not appear as column in the output)
  • \s+: one or more spaces
  • (?i)FROM: FROM statement case insensitive
  • \s+: one or more spaces
  • (?:(?P<schema>\w+)\.)?: Optional non-capturing group composed with:
    • (?P<schema>\w+): Capturing group named schema made only with words
    • \.: a dot (.)
  • (?P<table>\w+)\b: Capturing group named table made only with words and that has a right boundary

4. Regex with redshift

In redshift there are 4 string functions that work with regular expressions (regex). They are:

FunctionDescription
Domain LogoREGEXP_COUNTReturns an integer that indicates the number of times the pattern occurs in the string. If no match is found, then the function returns 0.
Domain LogoREGEXP_INSTRReturns an integer that indicates the beginning position or ending position of the matched substring. If no match is found, then the function returns 0.
Domain LogoREGEXP_REPLACEReplaces every occurrence of the pattern with the specified string.
Domain LogoREGEXP_SUBSTRReturns the characters extracted from a string by searching for a regular expression pattern.

There is also the SIMILAR operator for using regexs inside WHERE statements. For example:

SELECT count(*) FROM event WHERE name SIMILAR TO '%(Ring|Die)%';

-- It is the same as:
SELECT count(*) FROM event where name LIKE '%Ring%' OR name LIKE '%Die%';