N-Gram (Map Reduce)
lintcode
给出若干字符串和数字 N。用 Map Reduce 的方法统计所有 N-Grams 及其出现次数 。以字母为粒度。
Example
Given N = 3
doc_1: "abcabc"
doc_2: "abcabc"
doc_3: "bbcabc"
The final result should be:
[
"abc": 5,
"bbc": 1,
"bca": 3,
"cba": 3
]
思路
public class NGram {
public static class Map {
public void map(String _, int n, String str,
OutputCollector<String, Integer> output) {
for (int index = 0; index < str.length() - n + 1; ++index) {
output.collect(str.substring(index, index + n), 1);
}
}
}
public static class Reduce {
public void reduce(String key, Iterator<Integer> values,
OutputCollector<String, Integer> output) {
int sum = 0;
while (values.hasNext()) {
sum += values.next();
}
output.collect(key, sum);
}
}
}