Using regexs with python

2020-04-05
Python regex



regex

Table of Contents

1. Regex basics

A regular expression, or regex for short, is a pattern describing a certain amount of text. The most simple regex is a regular string, like hello which will match the literal text hello.

There are some characters with special meanings. For example . will match any character (except line break). As an example ip.d will match ipad, ipod, ip@d etc.

If we want to only match ipod or ipad we can use ip[ao]d.

1.1. Brackets []

Brackets ([ and ]) are used for specifying which characters are allowed. It is similar to an or operator. As an example:

1.1. Position

It is possible to specify matches based on the position with:

For example ^ab will only match the first ab in ab abc ab (Demo)

1.2. Classes

There are certain keywords that will match different types of characters. For example \d will match any digit (any of those 0123456789). Those classes can be inversed with capital letters. As an example \D will match any character except a digit.

The main classes available are:

regex description example inverse
\d Digits d_example \D
\w Alphanumeric and/or _ w_example \W
\s Whitespaces and equivalents (like tabs) s_example \S

1.3. Quantifiers

It is possible to specify how many times a letter or group (defined in the next section) should appear.

As an example we want the letter a + the letter b which should appear different number of times:

regex description example
ab* a + b zero times or more example_ab*
ab+ a + b one time or more example_ab+
ab? a + optional b (zero or one time) example_ab?
ab{2} a + b exactly 2 times example_ab{2}
ab{2,5} a + b between 2 and 5 times (inclusive) example_ab{2,5}

1.4. Groups

It is possible to define groups in regex. They are use to capture part of the text and are declared with parenthesis (). We can use any regex inside the group. For example if we want to math hello:

1.5. Boundaries

It is possible to declare a boundary with \b (with the inverse \B). A boundary means that is surrounded with a non word type character. It is similar to the start (^) and end ($).

It is easier to understand it with some examples:

regex text: Impossible to do text: I'm possible :) text: possibler example
possible Impossible to do I'm possible :) possibler boundaries_1
\bpossible Impossible to do I'm possible :) possibler boundaries_2
\bpossible\b Impossible to do I'm possible :) possibler boundaries_3
possible\B Impossible to do I'm possible :) possibler boundaries_4
\Bpossible Impossible to do I'm possible :) possibler boundaries_5

1.6. Flags

It is possible to change the behaviour of the regex with some flags:

1.7. Greedy vs lazy

Imagine that we have the following text:

<h1> Title </h1>
<p> Hello </p>

If we want to capture the start of an html tag (<h1> and <p> in this case) one might be tempted to use something like <\w+>. However that will match the whole line (for both lines) since we are specifying < followed by any text and ended with > (demo). If we want to stop it at the first appearence of > we can use the non greedy regex <.+?> (demo).

Some better alternatives are <\w+> or <[^<>]+>.

To sum up:

It can also be used with other quantifiers that are not the +.

2. Regex with python

2.1. re package

To use regex in python you first need to import the re package with

import re

Then to check if there is a match in a text with one regex using out = re.match(regex, text):

out = re.match(r"hello", "hello world")

# out is a class if we only want to know if there is a match we can use `bool(out)`

When declaring regexs in python it is a good practise to declare them as raw strings with r"regex"

We have use re.match since it's faster. However this only looks at the begining of the string. To do a more in deep search we can use re.search.

Let's compare them by doing checking different regexs against the following text:

text = """hello world
bye world"""

The results would be:

regex re.match re.search
r"hello" ✔️ ✔️
r"world" ✔️
r"bye" ✔️

2.3. re.compile

If we plan to reuse the same regex we can compile it with regex = re.compile(regex):

regex = re.compile(r"hello")

And then we can use both match and/or search with:

regex.match("hello world")

regex.search("hello world")

The advantage is that a compiled regex is faster than a non-compiled one.

2.4. Working with groups

Imagine that we have the following file names:

filenames = [
    "Informe mensual Indexa Capital - AABBCCDD - 2020-01.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-01.pdf",
    "Informe mensual Indexa Capital - AABBCCDD - 2021-03.pdf",
    "Informe mensual Indexa Capital - XXYYZZWW - 2020-12.pdf",
]

2.4.1. Unnamed groups

It looks more or less clear that we can extract some data from those names:

regex = re.compile(r"Informe mensual Indexa Capital - (\w{8}) - (\d{4})-(\d{2}).pdf")

out = regex.match(filenames[0])
out.groups()
Out: ('AABBCCDD', '2020', '01')

2.4.2. Named groups

We can use named groups to better extract the data

regex = re.compile(r"Informe mensual Indexa Capital - (?P<account>\w{8}) - (?P<year>\d{4})-(?P<month>\d{2}).pdf")

out = regex.match(filenames[1])
out.groupdict()
Out: {'account': 'XXYYZZWW', 'year': '2020', 'month': '01'}

Clearly we can use that to rename the files doing something like:

out_pattern = "{year}_{month} Indexa capital {account}.pdf"

for filename in filenames:
    groups = regex.match(filename).groupdict()
    new_name = out_pattern.format(**groups)

    print(new_name)
  • 2020_01 Indexa capital AABBCCDD.pdf
  • 2020_01 Indexa capital XXYYZZWW.pdf
  • 2021_03 Indexa capital AABBCCDD.pdf
  • 2020_12 Indexa capital XXYYZZWW.pdf
  • 3. Regex with pandas

    There are multiple functions in pandas that can be used with regex. They are:

    1. count: count occurrences that match a regex
    2. replace: replace based on a regex
    3. match: same as re.match
    4. contains: same as re.search
    5. findall: same as re.findall
    6. split / rsplit: splits text based on a regex
    7. extract: extract groups using a regex

    In order to explain them let's create a really dummy dataframe with:

    import pandas as pd
    
    queries = [
        "select *  FROM  users",
        "SELECT count(id) AS cnt FROM public.users",
        "SELECT * FROM orders",
        "SELECT city, count(1) FROM orders GROUP BY 1",
    ]
    df = pd.DataFrame(queries, columns=["query"])
    

    3.1. series.str.count

    In this case we want to count the number of queries that come from the users table. It doesn't matter if the schema is specified or not. It can be done with:

    df["query"].str.count(r"\s+FROM\s+(\w+\.)?users\b")
    
    Out:
  • 1
  • 1
  • 0
  • 0
  • 3.2. series.str.replace

    We can use this for deleting the renamings (like AS cnt):

    df["query"].str.replace(r"\s+AS\s+\w+\b", "", regex=True)
    
    Out:
  • select * FROM users
  • SELECT count(id) FROM public.users
  • SELECT * FROM orders
  • SELECT city, count(1) FROM orders GROUP BY 1
  • 3.3. series.str.match

    In this case we can look which queries are selecting all columns (SELECT * FROM):

    df["query"].str.match(r"(?i)(SELECT)\s+\*\s+(?i)(FROM)")
    
    Out:
  • True
  • False
  • True
  • False
  • 3.4. series.str.contains

    Similar to the replace example, let's get the queries that have a rename AS x:

    df["query"].str.contains(r"\s+AS\s+\w+")
    
    Out:
  • False
  • True
  • False
  • False
  • 3.5. series.str.findall

    We can use this for getting a list with all words that have between 5 and 8 letters:

    df["query"].str.findall("\w{5,8}")
    
    Out:
  • [select, users]
  • [SELECT, count, public, users]
  • [SELECT, orders]
  • [SELECT, count, orders, GROUP]
  • Note that the output is one list per row.

    3.6. series.str.split / series.str.rsplit

    In this case let's split the query in 2 parts, all before the FROM and all after it:

    df["query"].str.split("\s+(?i)from\s+")
    
    Out:
  • [select *, users]
  • [SELECT count(id) AS cnt, public.users]
  • [SELECT *, orders]
  • [SELECT city, count(1), orders GROUP BY 1]
  • 3.7. series.str.extract

    With this function we can extract data from groups. The output of this function is a DataFrame where each group is a column.

    regex = r"(?i)SELECT\s+(?:.+)\s+(?i)FROM\s+(?P<schema>\w+\.)?(?P<table>\w+)\b"
    df["query"].str.extract(regex)
    

    The output in this case is:

    schema table
    Nan users
    public. users
    Nan orders
    Nan orders

    The regex used in this example might be a little bit complex, let's analyze it by parts:

    4. Regex with redshift

    In redshift there are 4 string functions that work with regular expressions (regex). They are:

    Function Description
    REGEXP_COUNT Returns an integer that indicates the number of times the pattern occurs in the string. If no match is found, then the function returns 0.
    REGEXP_INSTR Returns an integer that indicates the beginning position or ending position of the matched substring. If no match is found, then the function returns 0.
    REGEXP_REPLACE Replaces every occurrence of the pattern with the specified string.
    REGEXP_SUBSTR Returns the characters extracted from a string by searching for a regular expression pattern.

    There is also the SIMILAR operator for using regexs inside WHERE statements. For example:

    SELECT count(*) FROM event WHERE name SIMILAR TO '%(Ring|Die)%';
    
    -- It is the same as:
    SELECT count(*) FROM event where name LIKE '%Ring%' OR name LIKE '%Die%';