Using regexs with python
1. Regex basics
A regular expression, or regex for short, is a pattern describing a certain amount of text.
The most simple regex is a regular string, like hello
which will match the literal text hello
.
There are some characters with special meanings. For example .
will match any character (except line break).
As an example ip.d
will match ipad
, ipod
, ip@d
etc.
If we want to only match ipod
or ipad
we can use ip[ao]d
.
1.1. Brackets []
Brackets ([
and ]
) are used for specifying which characters are allowed. It is similar to an or operator. As an example:
[abc]
:a
,b
orc
[a-c]
:a
,b
orc
[^a-c]
: anything buta
,b
orc
[0-9a-fA-F]
: any number or any letter betweena
andf
or betweenA
andF
. This represents hexadecimal
1.1. Position
It is possible to specify matches based on the position with:
^
: Start$
: End
For example ^ab
will only match the first ab
in ab abc ab
(Demo)
1.2. Classes
There are certain keywords that will match different types of characters.
For example \d
will match any digit (any of those 0123456789
).
Those classes can be inversed with capital letters. As an example \D
will match any character except a digit.
The main classes available are:
regex | description | example | inverse |
---|---|---|---|
\d | Digits | d_example | \D |
\w | Alphanumeric and/or _ | w_example | \W |
\s | Whitespaces and equivalents (like tabs) | s_example | \S |
1.3. Quantifiers
It is possible to specify how many times a letter or group (defined in the next section) should appear.
As an example we want the letter a
+ the letter b
which should appear different number of times:
regex | description | example |
---|---|---|
ab* | a + b zero times or more | example_ab* |
ab+ | a + b one time or more | example_ab+ |
ab? | a + optional b (zero or one time) | example_ab? |
ab{2} | a + b exactly 2 times | example_ab{2} |
ab{2,5} | a + b between 2 and 5 times (inclusive) | example_ab{2,5} |
1.4. Groups
It is possible to define groups in regex.
They are use to capture part of the text and are declared with parenthesis ()
.
We can use any regex inside the group. For example if we want to math hello
:
(hello)
: capturing group (Demo)(?:hello)
: non-capturing group(?P<name>hello)
: same but giving the group a name (name
in this case)(hello|bye):
or operator in groups.
1.5. Boundaries
It is possible to declare a boundary with \b
(with the inverse \B
).
A boundary means that is surrounded with a non word type character.
It is similar to the start (^
) and end ($
).
It is easier to understand it with some examples:
regex | text: Impossible to do | text: I'm possible :) | text: possibler | example |
---|---|---|---|---|
possible | Impossible to do | I’m possible :) | possibler | boundaries_1 |
\bpossible | Impossible to do | I’m possible :) | possibler | boundaries_2 |
\bpossible\b | Impossible to do | I’m possible :) | possibler | boundaries_3 |
possible\B | Impossible to do | I’m possible :) | possibler | boundaries_4 |
\Bpossible | Impossible to do | I’m possible :) | possibler | boundaries_5 |
1.6. Flags
It is possible to change the behaviour of the regex with some flags:
1.7. Greedy vs lazy
Imagine that we have the following text:
<h1> Title </h1>
<p> Hello </p>
If we want to capture the start of an html tag (<h1>
and <p>
in this case) one might be tempted to use something like <\w+>
.
However that will match the whole line (for both lines) since we are specifying <
followed by any text and ended with >
(Demo).
If we want to stop it at the first appearence of >
we can use the non greedy regex <.+?>
(Demo).
Some better alternatives are <\w+>
or <[^<>]+>
.
To sum up:
<.+>
: greedy<.+?>
: non greedy
It can also be used with other quantifiers that are not the +
.
2. Regex with python
2.1. re
package
To use regex in python you first need to import the re
package with
import re
Then to check if there is a match in a text with one regex using out = re.match(regex, text)
:
out = re.match(r"hello", "hello world")
# out is a class if we only want to know if there is a match we can use `bool(out)`
When declaring regexs in python it is a good practise to declare them as raw strings with r"regex"
2.2. match
vs search
We have use re.match
since it’s faster. However this only looks at the begining of the string.
To do a more in deep search we can use re.search
.
Let’s compare them by doing checking different regexs against the following text:
text = """hello world
bye world"""
The results would be:
regex | re.match | re.search |
---|---|---|
r"hello" | ✔️ | ✔️ |
r"world" | ❌ | ✔️ |
r"bye" | ❌ | ✔️ |
2.3. re.compile
If we plan to reuse the same regex we can compile it with regex = re.compile(regex)
:
regex = re.compile(r"hello")
And then we can use both match
and/or search
with:
regex.match("hello world")
regex.search("hello world")
The advantage is that a compiled regex is faster than a non-compiled one.
2.4. Working with groups
Imagine that we have the following file names:
filenames = [
"Informe mensual Indexa Capital - AABBCCDD - 2020-01.pdf",
"Informe mensual Indexa Capital - XXYYZZWW - 2020-01.pdf",
"Informe mensual Indexa Capital - AABBCCDD - 2021-03.pdf",
"Informe mensual Indexa Capital - XXYYZZWW - 2020-12.pdf",
]
2.4.1. Unnamed groups
It looks more or less clear that we can extract some data from those names:
regex = re.compile(r"Informe mensual Indexa Capital - (\w{8}) - (\d{4})-(\d{2}).pdf")
out = regex.match(filenames[0])
out.groups()
> (‘AABBCCDD’, ‘2020’, ‘01’)
2.4.2. Named groups
We can use named groups to better extract the data
regex = re.compile(r"Informe mensual Indexa Capital - (?P<account>\w{8}) - (?P<year>\d{4})-(?P<month>\d{2}).pdf")
out = regex.match(filenames[1])
out.groupdict()
>
{'account': 'XXYYZZWW', 'year': '2020', 'month': '01'}
Clearly we can use that to rename the files doing something like:
out_pattern = "{year}_{month} Indexa capital {account}.pdf"
for filename in filenames:
groups = regex.match(filename).groupdict()
new_name = out_pattern.format(**groups)
print(new_name)
2020_01 Indexa capital AABBCCDD.pdf 2020_01 Indexa capital XXYYZZWW.pdf 2021_03 Indexa capital AABBCCDD.pdf 2020_12 Indexa capital XXYYZZWW.pdf
3. Regex with pandas
There are multiple functions in pandas that can be used with regex. They are:
count
: count occurrences that match a regexreplace
: replace based on a regexmatch
: same asre.match
contains
: same asre.search
findall
: same asre.findall
split
/rsplit
: splits text based on a regexextract
: extract groups using a regex
In order to explain them let’s create a really dummy dataframe with:
import pandas as pd
queries = [
"select * FROM users",
"SELECT count(id) AS cnt FROM public.users",
"SELECT * FROM orders",
"SELECT city, count(1) FROM orders GROUP BY 1",
]
df = pd.DataFrame(queries, columns=["query"])
3.1. series.str.count
In this case we want to count
the number of queries that come from the users
table.
It doesn’t matter if the schema is specified or not. It can be done with:
df["query"].str.count(r"\s+FROM\s+(\w+\.)?users\b")
1 1 0 0
3.2. series.str.replace
We can use this for deleting the renamings (like AS cnt
):
df["query"].str.replace(r"\s+AS\s+\w+\b", "", regex=True)
select * FROM users
SELECT count(id) FROM public.users
SELECT * FROM orders
SELECT city, count(1) FROM orders GROUP BY 1
3.3. series.str.match
In this case we can look which queries are selecting all columns (SELECT * FROM
):
df["query"].str.match(r"(?i)(SELECT)\s+\*\s+(?i)(FROM)")
True False True False
3.4. series.str.contains
Similar to the replace
example, let’s get the queries that have a rename AS x
:
df["query"].str.contains(r"\s+AS\s+\w+")
False True False False
3.5. series.str.findall
We can use this for getting a list with all words that have between 5 and 8 letters:
df["query"].str.findall(r"\w{5,8}")
[
select
,users
] [SELECT
,count
,public
,users
] [SELECT
,orders
] [SELECT
,count
,orders
,GROUP
]
Note that the output is one list per row.
3.6. series.str.split
/ series.str.rsplit
In this case let’s split the query in 2 parts, all before the FROM
and all after it:
df["query"].str.split(r"\s+(?i)from\s+")
[
select *
,users
] [SELECT count(id) AS cnt
,public.users
] [SELECT *
,orders
] [SELECT city, count(1)
,orders GROUP BY 1
]
3.7. series.str.extract
With this function we can extract data from groups. The output of this function is a DataFrame where each group is a column.
regex = r"(?i)SELECT\s+(?:.+)\s+(?i)FROM\s+(?P<schema>\w+\.)?(?P<table>\w+)\b"
df["query"].str.extract(regex)
The output in this case is:
schema | table |
---|---|
Nan | users |
public. | users |
Nan | orders |
Nan | orders |
The regex used in this example might be a little bit complex, let’s analyze it by parts:
(?i)SELECT
:SELECT
statement case insensitive\s+
: one or more spaces(?:.+)
: any number of characters in a non-capturing group (so that it does not appear as column in the output)\s+
: one or more spaces(?i)FROM
:FROM
statement case insensitive\s+
: one or more spaces(?:(?P<schema>\w+)\.)?
: Optional non-capturing group composed with:(?P<schema>\w+)
: Capturing group namedschema
made only with words\.
: a dot (.
)
(?P<table>\w+)\b
: Capturing group namedtable
made only with words and that has a right boundary
4. Regex with redshift
In redshift there are 4 string functions that work with regular expressions (regex). They are:
Function | Description |
---|---|
REGEXP_COUNT | Returns an integer that indicates the number of times the pattern occurs in the string. If no match is found, then the function returns 0. |
REGEXP_INSTR | Returns an integer that indicates the beginning position or ending position of the matched substring. If no match is found, then the function returns 0. |
REGEXP_REPLACE | Replaces every occurrence of the pattern with the specified string. |
REGEXP_SUBSTR | Returns the characters extracted from a string by searching for a regular expression pattern. |
There is also the SIMILAR
operator for using regexs inside WHERE
statements. For example:
SELECT count(*) FROM event WHERE name SIMILAR TO '%(Ring|Die)%';
-- It is the same as:
SELECT count(*) FROM event where name LIKE '%Ring%' OR name LIKE '%Die%';