学校里面科研立项要做一个 英文作文自动批改系统 ,打算先实现 拼写错误检查 和 语法错误检查 .
1.模型使用
LingPipe 拼写错误检查 是基于Noisy Channel Model ,原始信息Input 经过一个黑箱发生变化,变成 Output.而我们的任务是通过 Output 来推测 Input 信息.
P(I)为先验概率,P(O | I)为转移概率,二者可以通过语料库建立的语言模型和转移矩阵(error model,channel model)得到. |
2.LingPipe中 spellcheck
a. LingPipe Spelling Tutorial
在 LingPipe的官网中有拼写错误的教程,介绍了什么是拼写纠错,在什么样的场景下使用以及 LingPipe 是如何实现拼写检查.
基本模型
搜索引擎(Lucene)提供文件的索引(Search Index)以及语言模型构建(NGram). 语言模型存储短语以及短语的统计学规律. 当一篇文章被提交时, LingPipe 中的src/QuerySpellCheck.java 检查 possible character edits : namely substitutions , insertions , replacment , 和 deletions. 如果你使用’Gretski’, language model将会建议修改成 ‘Gretzky’.
程序运行
QuerySpellCheck.java 综合了程序初始化以及Lucene 引擎的初始化.
基本工作顺序:
- 初始化 : 设置语言模型,拼写检查器,搜索引擎类.
- 语言模型构建, Lucene索引构建
- 通过命令行输入要检查的短语
初始化 Spell Checker
通过事例数据训练TrainSpellChecker 和构建搜索引擎的索引. 搜索引擎是通过 inverse index 来统计正确词汇拼写出现次数. TranSpellChecker设置如下:
FixedWeightEditDistance fixedEdit =
new FixedWeightEditDistance(MATCH_WEIGHT,
DELETE_WEIGHT,
INSERT_WEIGHT,
SUBSTITUTE_WEIGHT,
TRANSPOSE_WEIGHT);
NGramProcessLM lm = new NGramProcessLM(NGRAM_LENGTH);
TokenizerFactory tokenizerFactory
= new IndoEuropeanTokenizerFactory();
TrainSpellChecker sc
= new TrainSpellChecker(lm,fixedEdit,
tokenizerFactory);
FixedWeightEditDistance 设置每种编辑距离的权重; NGramProcessLM 设置语言模型; NGramProcessLM 是设置 tokenizer, 这个设置需要和 Lucene 的 tokennizer相匹配,需要 implements org.apache.lucene.analysis.Analyzer .
设置Lucene 搜索引擎
IndexWriter 通过 a directory to write the index to, an analyzer to tokenize the data and a boolean flag controlling whether a new index is created or the index is to be added to来实例化
static final File LUCENE_INDEX_DIR = new File("lucene");
static final StandardAnalyzer ANALYZER = new StandardAnalyzer(Version.LUCENE_30);
static final String TEXT_FIELD = "text";
...
FSDirectory fsDir
= new SimpleFSDirectory(LUCENE_INDEX_DIR,
new NativeFSLockFactory());
IndexWriter luceneIndexWriter
= new IndexWriter(fsDir, //索引存储位置
ANALYZER, //tokensize
IndexWriter.MaxFieldLength.LIMITED); //构建新索引还是添加(??)
...
文件索引和训练拼写检查器
添加数据很简单,复杂的是如何为搜索引擎构建索引.
String[] filesToIndex = DATA.list(); //获取训练数据的文件名
for (int i = 0; i < filesToIndex.length; ++i) {
System.out.println(" File="
+ DATA + "/"
+filesToIndex[i]); //tips: 方便数据错误时检查,是哪一个文件读入有错误
File file = new File(DATA,filesToIndex[i]);
String charSequence
= Files.readFromFile(file); //开始读文件,学会使用 File
sc.handle(charSequence); //对读入的字符串进行处理,切成一个小段,并且开始训练语言模型
Document luceneDoc = new Document(); //Lucene 索引文件
Field textField
= new Field(TEXT_FIELD,charSequence,
Field.Store.YES,Field.Index.TOKENIZED);
luceneDoc.add(textField);
luceneIndexWriter.addDocument(luceneDoc);
}
System.out.println("Writing model to file="
+ MODEL_FILE);
writeModel(sc,MODEL_FILE);
System.out.println("Writing lucene index to ="
+ LUCENE_INDEX_DIR);
luceneIndexWriter.close();
将数据导入到 Lucene 引擎中有点复杂,首先需要构建一个 Document 来为后续的 add 进操作, 在讲文件存储位置(TEXT_FIELD),数据(charsequence)以及一些 flag 变量来实例化 Field ,针对 Field 的使用还存有疑问,还未研究过.为了是这个 demo 保持简洁,并没有让 Lucene 的 analyzer匹配 LingPipe 的 tokenizer. 英文作者建议:
- Do case conversations and stoplisting in a lingpipe TokenizerFactory
- Adapt the LingPipe tokenizer factory to implement a lucence Analyzer
sc.handle(charsequent)
这句话我感觉是这个教程中比较关键的一行程序,他将整个系统的复杂性都隐藏在这个语句上,通过这个语句我感受到工业界编程的风格,也体会到了 Jetbrain 是真的强.
TrainSpellChecker类直接训练language model,然后和 weighted edit distance压缩(Compiled). 其中 handle 方法如下
public void handle(CharSequence cSeq) {
this.mLM.train(this.normalizeQuery(cSeq));
this.mNumTrainingChars += (long)cSeq.length();
}
mlM 是 NGramProcessLM 类的,我继续往上找.
NGramProcessLM 能动态提供模型的训练,估计,以及减除.
This class implements a generative language model based on chain rule. The maximum likelihood estimartor is smoothed by linear interpolation with the next lower-order context model.
public void train(CharSequence cSeq) {
this.train(cSeq, 1);
}
public void train(CharSequence cSeq, int incr) {
char[] cs = Strings.toCharArray(cSeq);
this.train(cs, 0, cs.length, incr);
}
public void train(char[] cs, int start, int end, int incr) {
Strings.checkArgsStartEnd(cs, start, end);
this.mTrieCharSeqCounter.incrementSubstrings(cs, start, end, incr);
}
// 继续往上翻寻TrieCharSeqCounter的incrementSubstrings方法
public void incrementSubstrings(char[] cs, int start, int end, int count) {
Strings.checkArgsStartEnd(cs, start, end);
int i;
for(i = start; i + this.mMaxLength <= end; ++i) {
this.incrementPrefixes(cs, i, i + this.mMaxLength, count);
}
for(i = Math.max(start, end - this.mMaxLength + 1); i < end; ++i) {
this.incrementPrefixes(cs, i, end, count);
}
}
A
TrieCharSeqCounter
stores counts for substrings of strings. When the counter is constructed, a maximum length is specified, and counts are only stored for strings up to that length. For instance, an n-gram language model needs only counts for strings up to length n.
TrieCharSeqCounter 类主要还是用来存储 strings 的数量.
由以上可见, train 仅仅是进行统计学的统计,计算出现频数,方便为后面didyouMean功能的实现.
训练好模型后,需要将其写入到 disk 中.
System.out.println("Writing model to file="
+ MODEL_FILE);
writeModel(sc,MODEL_FILE);
System.out.println("Writing lucene index to ="
+ LUCENE_INDEX_DIR);
luceneIndexWriter.close(); // wraps closing and writing the index to disk
而输出 Spell Checker 有些复杂,需要进行压缩.如果想鲁棒性更好的操作,需要使用 try/finally 操作.
private static void writeModel(TrainSpellChecker sc,
File modelFile)
throws IOException {
FileOutputStream fileOut
= new FileOutputStream(modelFile);
BufferedOutputStream bufOut
= new BufferedOutputStream(fileOut);
ObjectOutputStream objOut
= new ObjectOutputStream(bufOut); //Tip:层层 stream
// write the spell checker to the file
sc.compileTo(objOut);
Streams.closeOutputStream(objOut);
Streams.closeOutputStream(bufOut);
Streams.closeOutputStream(fileOut);
}
提示正确单词
while (true) {
// collect query or end if null
System.out.print(">");
System.out.flush();
String queryString = bufReader.readLine();
if (queryString == null || queryString.length() == 0)
break;
Query query = queryParser.parse(queryString);
TopDocs results = searcher.search(query,MAX_HITS); //寻找这个单词在多少个文章总有 hits. also providing method for iterating over the found documents in order of maching score
System.out.println("Found "
+ results.totalHits
+ " document(s) that matched query '"
+ query + "':");
// compute alternative spelling
String bestAlternative = sc.didYouMean(queryString);
Query alternativeQuery
= queryParser.parse(bestAlternative);
TopDocs results2
= luceneIndex.search(alternativeQuery,MAX_HITS);
System.out.println("Found " + results2.totalHits
+ " document(s) matched best alternate '"
+ bestAlternative + "':");
}
}
在这之中 didYouMean 是最酷的方法.他通过我们之前构建的语言模型来推测最适合的替代单词,如果没有找到就返回原来的结果.
bestAlternative = sc.didYouMean(query);
sc是 TrainSpellChecker 类,这个方法是 CompiledSpellChecker implement SpellChecker 类.而这是整个教程中最有趣也是最有挑战性的一个地方.
public String didYouMean(String receivedMsg) {
String msg = this.normalizeQuery(receivedMsg);
if(msg.length() == 0) {
return msg;
} else {
CompiledSpellChecker.DpSpellQueue queue = new CompiledSpellChecker.DpSpellQueue();
CompiledSpellChecker.DpSpellQueue finalQueue = new CompiledSpellChecker.DpSpellQueue();
this.computeBestPaths(msg, queue, finalQueue);
if(finalQueue.isEmpty()) {
return msg;
} else {
CompiledSpellChecker.State bestState = (CompiledSpellChecker.State)finalQueue.poll();
return bestState.output().trim();
}
}
}