572 复习 2.2 Search Engine Evaluation

  1. Skip Pointer: is techniques to speed up the merging posting
  2. Phrase Query:
    • Bi-word indexes
    • Positional index

2.2 Search Engine Evaluation

Recall Precison (分母动态)
tp/(tp + fp) tp/(tp + fn)
(A & B)/A A:relevant, B: retrieved (A & B)/B
  • harmonic mean of the precision and the recall

    F = 2PR/(P + R)

  • accuracy of an Engine:

    (tp + tn)/(tp + tn + fp + fn)


  • Precision, Recall:
    • Defination:
        1. Precision: # of retrieved relevant / # of retrieved
        2. Recall: # of retrieved relevant /# of relevant
        1. Precision = tp / (tp + fp)
        2. Recall = tp/(tp + fn)
    • Harmonic mean of precision, recall (F-measure):
      • Harmonic mean of n numbers: n/(1/a1 + 1/a2 + 1/a3 + ….1/an)
      • F-measure = 2RP/(R + P)
    • MAP = Sum(AvgP(i))/Q
  • Distance Measure:
    • 4 Properties:
        1. No negative
        2. D(x, y) iff x = y
        3. D(x, y) = D(y, x) symmetric
        4. D(x, y) <= D(x, z) + D(z, y) Triangle Inequality
    • 5 Specific distance measure :
        1. Jaccard Distance: (Jaccard Similarity: (A &B) /( A or B); Jaccard Distance: 1- JS)
        2. Hamming Distance:
        3. Euclidean Distance
  • Coordination of Distributed Crawlers
    • 3 different ways the crawlers can interact:
        1. Independent: no coordination, every process follows its extracted links
        2. Dynamic assignment: a central coordinator dynamically divides the web into small partitions and assign each partition to a process
        3. static assignment: web is partitioned and assigned without a central coordinator before the crawler starts.
  • DCG (Discounted Cumulative Gain) :

    = REL(1) + REL(2)/log(3) + REL(3)/log(4) + ….REL(n)/log(n + 1)

  • Power Law: y = K*x^C
    • Zipf’s Law: y = 1/x
      • y is the frequency of the word, and x is its ranking in the frequency table. C = -1.
      • Zipf’s law—log-log scale:

        since data has exponetial grow, when using log-log scale, it becomes a straight line with slope c like: logy = logK + Clogx

    • Heap’s Law: V= Kn^C
      • If V is the size of the vocabulary and n is the number of words
      • K = 10 ~ 100, C= 0.4 ~ 0.6
  • Skip pointer:
    • speed up the merging of positing in an inverted index
  • SoundEx algorithm:
    • For term with the same pronunciation to be encoded to the same string so that matching can occur despite minor difference in spelling.
  • Cosine Similarity:
    • sum(Ai Bi)/sqrt(sum(Ai^2)) sum(Bi^2)
  • HITS: Hubs and Authority
  • MRR(Mean Reciprocal Rank): 把rank倒数过来, 求平均数 sum(1/rank(Qi))/N
    • = sum(1/rank(Qi))/N
  • Lexicon (词典字典): The database of the vocabulary of a particular domain

    Tokenization(语义切分): The task of chopping a document into pieces called tokens,and possibly throw away possible characters.

  • properties of Crytographic Hash Function:
    • three main properties:
      1. easy to calculate hash value for any given data (对已知data hash 容易)
      2. difficult to get text that has a given hash vlaue (从hash value得到text难)
      3. a small change to data yields a different hash value (较小的改变可能产生不同的hash value)
      4. ——不记了
    • hash functions: MD5, SHA-1(SHA-2)
  • Collection frquency of term: is the num of times term t appears in he collection of document, counting multiple appearances.