python之面向对象

面向对象也称为OOP(object-oriented -programming),c++和Java是比较典型的面向对象编程语言,Python同样也支持面向对象的特性,但是其面向对象还是与上述两门语言有区别的,

面向对象的基本概念

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class ():
def __init__(self,title,author,context):
print("init function called")
self.title=title
self.auther=auther
self.__context=context

def get_context_length(self):
return len(self.__context)

def intercept_context(self,length):
self.__context=self.__context[:length]

harry_potter_book=Document('Harry potter','J.K.Rowling', '... Forever Do not believe any thing is capable of thinking independently ...')


print(harry_potter_book.title)
print(harry_potter_book.author)
print(harry_potter_book.get_context_length())

harry_potter_book.intercept_context(10)

print(harry_potter_book.get_context_length())

print(harry_potter_book.__context)

########## 输出 ##########

init function called
Harry Potter
J. K. Rowling
77
10

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-b4d048d75003> in <module>()
22 print(harry_potter_book.get_context_length())
23
---> 24 print(harry_potter_book.__context)

AttributeError: 'Document' object has no attribute '__context'

给出不那么严谨却容易理解的几个概念:

  • 类;一群有着相似事物的集合,对应Python中的class
  • 对象:集合中的一个事物,对应由class生成的一个实例(object),比如代码中的哈利波特
  • 属性:对象的某个静态特征,上述代码中的title,author,__context
  • 函数:对象中的某个动态特征,上述代码中的intercept_context()函数

严谨的定义则是:类:一群有着相同属性和对象的集合

可以看到class Document定义了Document类,下面有三个函数

其中.init函数代表构造函数,即一个对象生成是就会自动调用的函数,而其他两个函数则是正常的函数,让我们对对象中的属性进行操作,需要注意的是类中的私有属性不能被外部访问

下面再给出一个类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class ():
WELCOME_STR='Welcome! The context for this book is {}'

def __init__(self):
print('init function called')
self.title=title
self.author=author
self.__context=context


def create_empty_book(cls,title,author):
return cls(title=title,author=author,context='nothing')

#成员函数
def get_context_legth(self):
return len(self.__context)

@staticmethod #静态函数
def get_welcome(context):
return Document.WELCOME_STR.format(context)

empty_book = Document.create_empty_book('What Every Man Thinks About Apart from Sex', 'Professor Sheridan Simove')


print(empty_book.get_context_length())
print(empty_book.get_welcome('indeed nothing'))

########## 输出 ##########

init function called
7
Welcome! The context for this book is indeed nothing.

全大写来表示常量,可以在类中直接使用或者在类外面使用Entity.常量名

下面介绍类中的三种函数

  • 静态函数,一般加上@staticmethod修饰,第一个参数没有任何特征,一般做一些简单独立的任务,方便调试和优化代码结构
  • 类函数的第一个对象一般为cls,类函数最常用的功能是实现不同的init函数,
  • 成员函数则是正常的类中函数,第一个参数self表示当前对象的引用,通过此函数完成查询/修改等操作

继承

类的继承就是,是指一个类拥有另一个类的特征,也拥有不同于另一个类的独特特征,第一个类称为子类,另一个类称为父类,特征就是类的属性和函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class Entity():
def __init__(self, object_type):
print('parent class init called')
self.object_type = object_type

def get_context_length(self):
raise Exception('get_context_length not implemented')

def print_title(self):
print(self.title)

class (Entity):
def __init__(self, title, author, context):
print('Document class init called')
Entity.__init__(self, 'document')
self.title = title
self.author = author
self.__context = context

def get_context_length(self):
return len(self.__context)

class Video(Entity):
def __init__(self, title, author, video_length):
print('Video class init called')
Entity.__init__(self, 'video')
self.title = title
self.author = author
self.__video_length = video_length

def get_context_length(self):
return self.__video_length

harry_potter_book = Document('Harry Potter(Book)', 'J. K. Rowling', '... Forever Do not believe any thing is capable of thinking independently ...')
harry_potter_movie = Video('Harry Potter(Movie)', 'J. K. Rowling', 120)

print(harry_potter_book.object_type)
print(harry_potter_movie.object_type)

harry_potter_book.print_title()
harry_potter_movie.print_title()

print(harry_potter_book.get_context_length())
print(harry_potter_movie.get_context_length())

########## 输出 ##########

Document class init called
parent class init called
Video class init called
parent class init called
document
video
Harry Potter(Book)
Harry Potter(Movie)
77
120

Document和Video都有相似的地方,比如说都有相应的标题,作者和内容等属性,我们从中抽象出一个Entity的类,作为它们的父类

继承类自动生成对象时是不会自动调用父类的构造函数,因此必须要在子类的init()函数中显示调用父类的构造函数,执行顺序为子类的构造函数->父类的构造函数

父类中的get_context_length函数.如果直接使用entity生成对象,调用就会raise error中断程序的执行,而子类必须通过函数重写来覆盖掉原有的函数

最后子类可以毫无压力的调用父类定义的print_title函数.这就体现了继承的优点:减少重复的代码,降低系统的熵值(复杂度).

抽象函数和抽象类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from abc import ABCMeta, abstractmethod

class Entity(metaclass=ABCMeta):
@abstractmethod
def get_title(self):
pass

@abstractmethod
def set_title(self, title):
pass

class (Entity):
def get_title(self):
return self.title

def set_title(self, title):
self.title = title

document = Document()
document.set_title('Harry Potter')
print(document.get_title())

entity = Entity()

########## 输出 ##########

Harry Potter

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-266b2aa47bad> in <module>()
21 print(document.get_title())
22
---> 23 entity = Entity()
24 entity.set_title('Test')

TypeError: Can't instantiate abstract class Entity with abstract methods get_title, set_title

Entity类本身没有什么意义,只需要定义一些基本元素即可,为了防止不小心生成Entity的对象,就需要引入抽象类

抽象类是一种特殊的类,他生下来就是作为父类存在的,一旦对象化就会报错,抽象函数定义在抽象类中,子类必须重写抽象函数,使用@acstractmethod修饰

这就是软件工程中很重要的概念:定义接口,抽象类就是这么一种存在,它是一种自上而下的设计风范,你只需要用少量的代码描述清楚要做的事情,定义好接口,然后交由开发人员开发和对接

面向对象实例(完成一个简单的搜索引擎)

搜索引擎有四部分组成搜索器,索引器,检索器和用户接口.

搜索器就是爬虫,从互联网的各类网站中爬取内容,送给索引器.索引器拿到网页和内容后,对内容进行处理,形成索引(index),存储于数据库等待检索.用户接口则是我们熟悉的网页或者是app前端页面,用户可以通过用户接口向搜索引擎发出询问(query),询问解析后到达检索器,高效检索后再将结果返回给用户.

爬虫不是重点,假设我们搜索样本位于本地磁盘(提供五个文件的检索)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 1.txt
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

接下来给出Search EngineBase基类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class SearchEngineBase(object):
def __init__(self):
pass

def add_corpus(self, file_path):
with open(file_path, 'r') as fin:
text = fin.read()
self.process_corpus(file_path, text)

def process_corpus(self, id, text):
raise Exception('process_corpus not implemented.')

def search(self, query):
raise Exception('search not implemented.')

def main(search_engine):
for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
search_engine.add_corpus(file_path)

while True:
query = input()
results = search_engine.search(query)
print('found {} result(s):'.format(len(results)))
for result in results:
print(result)

SearchEngineBase是可以被继承的,继承的类分别代表不同的算法引擎,每一个算法引擎都应该实现process_corpus()和search()两个函数,对应我们刚刚的索引器和检索器,main()函数则是提供搜索器和用户接口,代码具体实现

  • add_corpus()函数负责读取文件内容,把文件路径作为id连同内容一起送到process_corpus中
  • process_corpus需要对内容进行处理,然后文件路径为id,将处理的内容存下来即为索引
  • search则给定一个询问,处理询问,通过索引检索后返回

下面实现一个基本的搜索引擎

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class  SimpleEngine(SeachEngineBase):
def __init__(self):
super(SimpleEngine,self).__init__()
self.__id_to_texts={}

def process_cropus(self,id,text):
self.__id_to_texts[id]=text

def search(self,query):
results=[]
for id,text in self.__id_to_texts.item():
if query in text:
results.append[id]
return results

search_engin=SimpleEngine
main(SimpleEngine)

class SimpleEngine(SearchEngineBase):
def __init__(self):
super(SimpleEngine, self).__init__()
self.__id_to_texts = {}

def process_corpus(self, id, text):
self.__id_to_texts[id] = text

def search(self, query):
results = []
for id, text in self.__id_to_texts.items():
if query in text:
results.append(id)
return results

search_engine = SimpleEngine()
main(search_engine)

########## 输出 ##########


simple
found 0 result(s):
little
found 2 result(s):
1.txt
2.txt

分析一下这段代码,simpleengine实现了一个继承自searchenginebase的子类,继承并实现了process_corpus和search 函数,

新的构造函数中,self.__id_to_texts={}初始化了自己的私有变量,也是用来存储文件名到文件内容的字典

不难发现这个程序的搜索应该是效率很低下的,为此我们应该对其进行优化.最直接的一个办法就是语料分词,看成一个个的词汇,然后只需要对每篇文章存储它的set(集合)即可.根据齐夫定律:在自然语言的语料库中,一个单词出现的评率与它在单词表中的排名成反比,呈现幂律分布.语料分词可以大大提高效率.

实现代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import re
class BOWengine(SearchEngineBase):
def __init__(self):
super(BOWengine,self).__init__()
self.__id_to_words={}

def process_cropus(self,id,text):
sellf.__id_to_words[id]=self.parse_text_to_words(text)

def search(self,query):
query_words=self.parse_text_to_words(query)
results=[]
for id,text in self.__id_to_words.items():
if self.query_match(query_words,words):
results.append(id)
return results

@staticmethod
def query_match(query_words,words):
for query_word in query_words:
if query_word not in words:
return False
return True

@staticmethod
def parse_text_to_words(text):
text=re.sub(r'[^w]',' ',text) #使用正则表达式去除标点符号和换行符
text=text.lower() #转化为小写
word_list=text.split(' ') #生成单词列表
word_list=filter(None,word_list)#去除空白单词
return set(world_list)

search_engine = BOWEngine()
main(search_engine)


########## 输出 ##########


i have a dream
found 3 result(s):
1.txt
2.txt
3.txt
freedom children
found 1 result(s):
5.txt

先引入一个概念叫做Bag of Words Model,中文名叫做词袋模型,假设一个文本不考虑语法,句法,段落等,只将文章看成词汇的集合,所以只要存储这些单词,不用去考虑顺序

倒序索引后的优化代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import re

class BOWInvertedIndexEngine(SearchEngineBase):
def __init__(self):
super(BOWInvertedIndexEngine, self).__init__()
self.inverted_index = {}

def process_corpus(self, id, text):
words = self.parse_text_to_words(text)
for word in words:
if word not in self.inverted_index:
self.inverted_index[word] = []
self.inverted_index[word].append(id)

def search(self, query):
query_words = list(self.parse_text_to_words(query))
query_words_index = list()
for query_word in query_words:
query_words_index.append(0)

# 如果某一个查询单词的倒序索引为空,我们就立刻返回
for query_word in query_words:
if query_word not in self.inverted_index:
return []

result = []
while True:

# 首先,获得当前状态下所有倒序索引的 index
current_ids = []

for idx, query_word in enumerate(query_words):
current_index = query_words_index[idx]
current_inverted_list = self.inverted_index[query_word]

# 已经遍历到了某一个倒序索引的末尾,结束 search
if current_index >= len(current_inverted_list):
return result

current_ids.append(current_inverted_list[current_index])

# 然后,如果 current_ids 的所有元素都一样,那么表明这个单词在这个元素对应的文档中都出现了
if all(x == current_ids[0] for x in current_ids):
result.append(current_ids[0])
query_words_index = [x + 1 for x in query_words_index]
continue

# 如果不是,我们就把最小的元素加一
min_val = min(current_ids)
min_val_pos = current_ids.index(min_val)
query_words_index[min_val_pos] += 1

@staticmethod
def parse_text_to_words(text):
text=re.sub(r'[^w]',' ',text) #使用正则表达式去除标点符号和换行符
text=text.lower() #转化为小写
word_list=text.split(' ') #生成单词列表
word_list=filter(None,word_list)#去除空白单词
return set(world_list)


search_engine = BOWInvertedIndexEngine()
main(search_engine)


########## 输出 ##########


little
found 2 result(s):
1.txt
2.txt
little vicious
found 1 result(s):
2.txt

倒序索引就是将word映射到id的字典,在search的时候我们只需要把想要的query_word的几个倒序索引单独拎出来,然后找到这几个找到列表中共有元素,即id,从而避免将所有的index过一遍

process_corpus建立倒序索引,

search函数则是根据query_words拿到所有的倒序索引,如果不存在,就直接返回,拿到之后就运行合并K个有序数组的算法.但这并不是最优的.最优的应该是用最小堆来存储index,算法挺难,我也看不懂emmm

LRU和多重继承

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pylru

class LRUCache(object):
def __init__(self, size=32):
self.cache = pylru.lrucache(size)

def has(self, key):
return key in self.cache

def get(self, key):
return self.cache[key]

def set(self, key, value):
self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
def __init__(self):
super(BOWInvertedIndexEngineWithCache, self).__init__()
LRUCache.__init__(self)

def search(self, query):
if self.has(query):
print('cache hit!')
return self.get(query)

result = super(BOWInvertedIndexEngineWithCache, self).search(query)
self.set(query, result)

return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)


########## 输出 ##########


little
found 2 result(s):
1.txt
2.txt
little
cache hit!
found 2 result(s):
1.txt
2.txt

LRUCache定义了一个缓存类,可以通过继承这个类来调用其他算法,LRU缓存是很经典的缓存.

可以看到BOWInvertedIndexEngineWithCache类多重继承了两个类,多重继承有下面两种初始化方法

  • 第一种就是super(子类,self).–nit–()直接初始化该类的第一个父类,不过父类必须继承object
  • 第二种就是有多个构造函数是使用的父类.–init–(self)

强行调用被覆盖的父类函数

1
super(子类,self).函数(参数)

面向对象是很重要的一种编程思想,要多理解,多运用,才能很好的掌握!