
- Skip Pointer: is techniques to speed up the merging posting
- Phrase Query:
- Bi-word indexes
- Positional index
2.2 Search Engine Evaluation
| Recall | Precison (分母动态) |
|---|---|
| tp/(tp + fp) | tp/(tp + fn) |
| (A & B)/A A:relevant, B: retrieved | (A & B)/B |
- harmonic mean of the precision and the recall
F = 2PR/(P + R)
- accuracy of an Engine:
(tp + tn)/(tp + tn + fp + fn)
- Precision, Recall:
- Defination:
-
- Precision: # of retrieved relevant / # of retrieved
- Recall: # of retrieved relevant /# of relevant
-
- Precision = tp / (tp + fp)
- Recall = tp/(tp + fn)
-
- Harmonic mean of precision, recall (F-measure):
- Harmonic mean of n numbers: n/(1/a1 + 1/a2 + 1/a3 + ….1/an)
- F-measure = 2RP/(R + P)
- MAP = Sum(AvgP(i))/Q
- Defination:
- Distance Measure:
- 4 Properties:
-
- No negative
- D(x, y) iff x = y
- D(x, y) = D(y, x) symmetric
- D(x, y) <= D(x, z) + D(z, y) Triangle Inequality
-
- 5 Specific distance measure :
-
- Jaccard Distance: (Jaccard Similarity: (A &B) /( A or B); Jaccard Distance: 1- JS)
- Hamming Distance:
- Euclidean Distance
-
- 4 Properties:
- Coordination of Distributed Crawlers
- 3 different ways the crawlers can interact:
-
- Independent: no coordination, every process follows its extracted links
- Dynamic assignment: a central coordinator dynamically divides the web into small partitions and assign each partition to a process
- static assignment: web is partitioned and assigned without a central coordinator before the crawler starts.
-
- 3 different ways the crawlers can interact:
- DCG (Discounted Cumulative Gain) :
= REL(1) + REL(2)/log(3) + REL(3)/log(4) + ….REL(n)/log(n + 1)
- Power Law: y = K*x^C
- Zipf’s Law: y = 1/x
- y is the frequency of the word, and x is its ranking in the frequency table. C = -1.
- Zipf’s law—log-log scale:
since data has exponetial grow, when using log-log scale, it becomes a straight line with slope c like: logy = logK + Clogx
- Heap’s Law: V= Kn^C
- If V is the size of the vocabulary and n is the number of words
- K = 10 ~ 100, C= 0.4 ~ 0.6
- Zipf’s Law: y = 1/x
- Skip pointer:
- speed up the merging of positing in an inverted index
- SoundEx algorithm:
- For term with the same pronunciation to be encoded to the same string so that matching can occur despite minor difference in spelling.
- Cosine Similarity:
- sum(Ai Bi)/sqrt(sum(Ai^2)) sum(Bi^2)
- HITS: Hubs and Authority
- MRR(Mean Reciprocal Rank): 把rank倒数过来, 求平均数 sum(1/rank(Qi))/N
- = sum(1/rank(Qi))/N
- Lexicon (词典字典): The database of the vocabulary of a particular domain
Tokenization(语义切分): The task of chopping a document into pieces called tokens,and possibly throw away possible characters.
- properties of Crytographic Hash Function:
- three main properties:
- easy to calculate hash value for any given data (对已知data hash 容易)
- difficult to get text that has a given hash vlaue (从hash value得到text难)
- a small change to data yields a different hash value (较小的改变可能产生不同的hash value)
- ——不记了
- hash functions: MD5, SHA-1(SHA-2)
- three main properties:
- Collection frquency of term: is the num of times term t appears in he collection of document, counting multiple appearances.




近期评论