Regular Expressions in Python 3

^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
s Matches whitespace
S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more time (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of character can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

Import the library

“import re” for the regular expression library

re.search()

To see if a string matches a regular expression, similar to using the find() method of strings

re.findall()

To extract portions of a string that match your regular expression, similar to a combination of find() and slicing.

Extracting Data

import re


fileHandler = open('mbox-short.txt')

for line in fileHandler:
    line = line.rstrip()
    # Search the lines that contain 'stephen'
    if re.search('stephen', line):
    # is the same as:
    # if line.find('stephen') >= 0:
        print('Has "stephen": ', line)
    
    # Search the lines that start with 'From:'
    if re.search('^From:', line):
        # is the same as:
        # if line.startswith('From:'):
        print('Starts with "From:"', line)

    if re.search('^X-S+:', line):
        print('Starts with X and end with colon', line)

x = 'My 2 favourite numbers are 9 and 32.'
# Find all numbers [0-9] in x, '+' means one or more times
# findall() returns a list. Its elements are strings, not numbers.
y = re.findall('[0-9]+', x)
print(y) # ['2', '9', '32']

Greedy and None-Greedy Matching

The repeat characters (* and +) return the largest (longest) results (greedy).

a = 'From: Using : character'
b_greedy = re.findall('^F.+:', a)
print(b_greedy) # ['From: Using :']

b_not_greedy = re.findall('^F.+?:', a)
print(b_not_greedy) # ['From:']

The regular expression wants to find parts that start with “F” and end with “:”. Because it has “+”, it is greedy.
There are two results matching the regular expression, “From:” and “From: Using :”. Because it is greedy, it returns the largest possible result, which is the second one.
The second example, b_not_greedy uses “+?”, which is not greedy. So it returns shorter ones.

Parentheses

c = 'From [email protected] Sat Jan 5 11:16 pm'
# Use parentheses to only extract the inner part
b_parentheses = re.findall('^From ([email protected]+)', c)
print(b_parentheses) # ['[email protected]']

Tips

[^ ]: match non-blank character

python course 3 week 2 Import the library re.search() re.findall() Extracting Data Greedy and None-Greedy Matching Parentheses Tips

Import the library

re.search()

re.findall()

Extracting Data

Greedy and None-Greedy Matching

Parentheses

Tips

近期文章

近期评论

标签

热门

文章归档

分类目录

功能