uva 10476 – spam or not spam

Problem

判斷訊息是否為垃圾訊息。

給數組垃圾訊息、非垃圾訊息，從中統計語言的出現頻率。例如 ABCDEFG 取 k-shingle 中 k = 3，那麼總共出現 ABC、BCD、CDE、DEF … 等。

接著按照題目給定的公式，計算輸入的訊息離垃圾訊息比較接近、還是非垃圾訊息。

n-gram 通常指的是以單詞 (word) 為切割單位，而 k-shingle 則是以字符 (char) 為切割單位。在巨量資料的開放課程中，以 shingle 去描述這樣的切割方式。

Sample Input

2 1 1
AAAA
BBBB CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE
2 1 2
AAAA
BBBB CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE
AAABB
ENDMESSAGE
0 0 0

Sample Output

Set 1:
0.21822 0.73030
non-spam
Set 2:
0.21822 0.73030
non-spam
0.21822 0.73030
non-spam

Solution

相當有趣的題目，終於把自然語言學習到的部分拿出來玩。但可惜題目沒有巨量資料那樣瘋狂，用 hash 去失真的操作，由於記憶體不夠，用 1-pass/2-pass 找 frequent items 之類的。

為了加速運算，改成 hash 會比較快，但只要一碰撞輸出機率時就會炸掉，所以還是別貿然嘗試。

特別用了 std::inner_product 來寫中間的數學運算，但仔細想想如果是兩個 std::map 要進行 std::inner_product() 仍然會因為元素不見得都會在兩個的 key set 而無法運作。std::inner_product 內建實作中只有單純的迭代，用在 vector 類型比較合適。

#include <stdio.h> 
#include <map>
#include <iostream>
#include <algorithm>
#include <numeric>
#include <math.h>
#include <string.h>
using namespace std;
int map_product(pair<string, int> l, pair<string, int> r) {
	return l.second * r.second;
}
class Stat {
public:
	int kShingle;
	int sqsum;
	map<string, int> count;
	void init(int k) {
		kShingle = k, sqsum = 0;
		count.clear();
	}
	string toKey(string &s) { // retain for hash function
		return s;
	}
	void add(char s[]) {
		if (s[0] == '

uva 10476 – spam or not spam

contents

Problem

Sample Input

Sample Output

Solution

近期文章

近期评论

标签

热门

文章归档

分类目录

功能