python course 3 week 2 Import the library re.search() re.findall() Extracting Data Greedy and None-Greedy Matching Parentheses Tips

Regular Expressions in Python 3

^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
s Matches whitespace
S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more time (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of character can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

Import the library

“import re” for the regular expression library

re.search()

To see if a string matches a regular expression, similar to using the find() method of strings

re.findall()

To extract portions of a string that match your regular expression, similar to a combination of find() and slicing.
regular-expression

Extracting Data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import re


fileHandler = open('mbox-short.txt')

for line in fileHandler:
line = line.rstrip()
# Search the lines that contain 'stephen'
if re.search('stephen', line):
# is the same as:
# if line.find('stephen') >= 0:
print('Has "stephen": ', line)

# Search the lines that start with 'From:'
if re.search('^From:', line):
# is the same as:
# if line.startswith('From:'):
print('Starts with "From:"', line)

if re.search('^X-S+:', line):
print('Starts with X and end with colon', line)
1
2
3
4
5
x = 'My 2 favourite numbers are 9 and 32.'
# Find all numbers [0-9] in x, '+' means one or more times
# findall() returns a list. Its elements are strings, not numbers.
y = re.findall('[0-9]+', x)
print(y) # ['2', '9', '32']

Greedy and None-Greedy Matching

The repeat characters (* and +) return the largest (longest) results (greedy).

1
2
3
4
5
6
a = 'From: Using : character'
b_greedy = re.findall('^F.+:', a)
print(b_greedy) # ['From: Using :']

b_not_greedy = re.findall('^F.+?:', a)
print(b_not_greedy) # ['From:']

The regular expression wants to find parts that start with “F” and end with “:”. Because it has “+”, it is greedy.
There are two results matching the regular expression, “From:” and “From: Using :”. Because it is greedy, it returns the largest possible result, which is the second one.
The second example, b_not_greedy uses “+?”, which is not greedy. So it returns shorter ones.

Parentheses

1
2
3
4
c = 'From [email protected] Sat Jan 5 11:16 pm'
# Use parentheses to only extract the inner part
b_parentheses = re.findall('^From ([email protected]+)', c)
print(b_parentheses) # ['[email protected]']

Tips

[^ ]: match non-blank character