Share as bigheiniu.

Lingpipe001

Posted on By Big heiniu

学校里面科研立项要做一个 英文作文自动批改系统 ,打算先实现 拼写错误检查语法错误检查 .

1.模型使用

LingPipe 拼写错误检查 是基于Noisy Channel Model ,原始信息Input 经过一个黑箱发生变化,变成 Output.而我们的任务是通过 Output 来推测 Input 信息. download

P(I)为先验概率,P(O I)为转移概率,二者可以通过语料库建立的语言模型和转移矩阵(error model,channel model)得到.

2.LingPipe中 spellcheck

a. LingPipe Spelling Tutorial

在 LingPipe的官网中有拼写错误的教程,介绍了什么是拼写纠错,在什么样的场景下使用以及 LingPipe 是如何实现拼写检查.

基本模型

搜索引擎(Lucene)提供文件的索引(Search Index)以及语言模型构建(NGram). 语言模型存储短语以及短语的统计学规律. 当一篇文章被提交时, LingPipe 中的src/QuerySpellCheck.java 检查 possible character edits : namely substitutions , insertions , replacment , 和 deletions. 如果你使用’Gretski’, language model将会建议修改成 ‘Gretzky’.

程序运行

QuerySpellCheck.java 综合了程序初始化以及Lucene 引擎的初始化.

基本工作顺序:

  • 初始化 : 设置语言模型,拼写检查器,搜索引擎类.
  • 语言模型构建, Lucene索引构建
  • 通过命令行输入要检查的短语

初始化 Spell Checker

通过事例数据训练TrainSpellChecker 和构建搜索引擎的索引. 搜索引擎是通过 inverse index 来统计正确词汇拼写出现次数. TranSpellChecker设置如下:

FixedWeightEditDistance fixedEdit =
    new FixedWeightEditDistance(MATCH_WEIGHT,
                                DELETE_WEIGHT,
                                INSERT_WEIGHT,
                                SUBSTITUTE_WEIGHT,
                                TRANSPOSE_WEIGHT);

NGramProcessLM lm = new NGramProcessLM(NGRAM_LENGTH);
TokenizerFactory tokenizerFactory
    = new IndoEuropeanTokenizerFactory();
TrainSpellChecker sc
    = new TrainSpellChecker(lm,fixedEdit,
                            tokenizerFactory);

FixedWeightEditDistance 设置每种编辑距离的权重; NGramProcessLM 设置语言模型; NGramProcessLM 是设置 tokenizer, 这个设置需要和 Lucene 的 tokennizer相匹配,需要 implements org.apache.lucene.analysis.Analyzer .

设置Lucene 搜索引擎

IndexWriter 通过 a directory to write the index to, an analyzer to tokenize the data and a boolean flag controlling whether a new index is created or the index is to be added to来实例化

 static final File LUCENE_INDEX_DIR = new File("lucene");
    static final StandardAnalyzer ANALYZER = new StandardAnalyzer(Version.LUCENE_30);
    static final String TEXT_FIELD = "text";
...
        FSDirectory fsDir 
            = new SimpleFSDirectory(LUCENE_INDEX_DIR,
                                    new NativeFSLockFactory());
        IndexWriter luceneIndexWriter 
            = new IndexWriter(fsDir, //索引存储位置
                              ANALYZER, //tokensize
                              IndexWriter.MaxFieldLength.LIMITED); //构建新索引还是添加(??)
...
文件索引和训练拼写检查器

添加数据很简单,复杂的是如何为搜索引擎构建索引.

String[] filesToIndex = DATA.list(); //获取训练数据的文件名
for (int i = 0; i < filesToIndex.length; ++i) {
    System.out.println("     File="
                       + DATA + "/"
                       +filesToIndex[i]); //tips: 方便数据错误时检查,是哪一个文件读入有错误
    File file = new File(DATA,filesToIndex[i]);
    String charSequence
        = Files.readFromFile(file); //开始读文件,学会使用 File
    sc.handle(charSequence); //对读入的字符串进行处理,切成一个小段,并且开始训练语言模型
    Document luceneDoc = new Document(); //Lucene 索引文件
    Field textField
        = new Field(TEXT_FIELD,charSequence,
                    Field.Store.YES,Field.Index.TOKENIZED);
    luceneDoc.add(textField); 
    luceneIndexWriter.addDocument(luceneDoc);
}

System.out.println("Writing model to file="
                   + MODEL_FILE);
writeModel(sc,MODEL_FILE);

System.out.println("Writing lucene index to ="
                   + LUCENE_INDEX_DIR);
luceneIndexWriter.close();

将数据导入到 Lucene 引擎中有点复杂,首先需要构建一个 Document 来为后续的 add 进操作, 在讲文件存储位置(TEXT_FIELD),数据(charsequence)以及一些 flag 变量来实例化 Field ,针对 Field 的使用还存有疑问,还未研究过.为了是这个 demo 保持简洁,并没有让 Lucene 的 analyzer匹配 LingPipe 的 tokenizer. 英文作者建议:

  • Do case conversations and stoplisting in a lingpipe TokenizerFactory
  • Adapt the LingPipe tokenizer factory to implement a lucence Analyzer
sc.handle(charsequent)

这句话我感觉是这个教程中比较关键的一行程序,他将整个系统的复杂性都隐藏在这个语句上,通过这个语句我感受到工业界编程的风格,也体会到了 Jetbrain 是真的强.

TrainSpellChecker类直接训练language model,然后和 weighted edit distance压缩(Compiled). 其中 handle 方法如下

public void handle(CharSequence cSeq) {
    this.mLM.train(this.normalizeQuery(cSeq));
    this.mNumTrainingChars += (long)cSeq.length();
}

mlMNGramProcessLM 类的,我继续往上找.

NGramProcessLM 能动态提供模型的训练,估计,以及减除.

This class implements a generative language model based on chain rule. The maximum likelihood estimartor is smoothed by linear interpolation with the next lower-order context model.

public void train(CharSequence cSeq) {
    this.train(cSeq, 1);
}

public void train(CharSequence cSeq, int incr) {
    char[] cs = Strings.toCharArray(cSeq);
    this.train(cs, 0, cs.length, incr);
}

public void train(char[] cs, int start, int end, int incr) {
        Strings.checkArgsStartEnd(cs, start, end);
        this.mTrieCharSeqCounter.incrementSubstrings(cs, start, end, incr);
    }

//  继续往上翻寻TrieCharSeqCounter的incrementSubstrings方法
  
  public void incrementSubstrings(char[] cs, int start, int end, int count) {
        Strings.checkArgsStartEnd(cs, start, end);

        int i;
        for(i = start; i + this.mMaxLength <= end; ++i) {
            this.incrementPrefixes(cs, i, i + this.mMaxLength, count);
        }

        for(i = Math.max(start, end - this.mMaxLength + 1); i < end; ++i) {
            this.incrementPrefixes(cs, i, end, count);
        }

    }

A TrieCharSeqCounter stores counts for substrings of strings. When the counter is constructed, a maximum length is specified, and counts are only stored for strings up to that length. For instance, an n-gram language model needs only counts for strings up to length n.

TrieCharSeqCounter 类主要还是用来存储 strings 的数量.

由以上可见, train 仅仅是进行统计学的统计,计算出现频数,方便为后面didyouMean功能的实现.

训练好模型后,需要将其写入到 disk 中.

System.out.println("Writing model to file="
                   + MODEL_FILE);
writeModel(sc,MODEL_FILE);

System.out.println("Writing lucene index to ="
                   + LUCENE_INDEX_DIR);
luceneIndexWriter.close(); // wraps closing and writing the index to disk

而输出 Spell Checker 有些复杂,需要进行压缩.如果想鲁棒性更好的操作,需要使用 try/finally 操作.

private static void writeModel(TrainSpellChecker sc,
                               File modelFile)
    throws IOException {

    FileOutputStream fileOut
        = new FileOutputStream(modelFile);
    BufferedOutputStream bufOut
        = new BufferedOutputStream(fileOut);
    ObjectOutputStream objOut
        = new ObjectOutputStream(bufOut); //Tip:层层 stream

    // write the spell checker to the file
    sc.compileTo(objOut);

    Streams.closeOutputStream(objOut);
    Streams.closeOutputStream(bufOut);
    Streams.closeOutputStream(fileOut);
}
提示正确单词
 while (true) {
        // collect query or end if null
        System.out.print(">");
        System.out.flush();
        String queryString = bufReader.readLine();
        if (queryString == null || queryString.length() == 0)
            break;

        Query query = queryParser.parse(queryString);
        TopDocs results = searcher.search(query,MAX_HITS); //寻找这个单词在多少个文章总有 hits. also providing method for iterating over the found documents in order of maching score
        System.out.println("Found "
            + results.totalHits
            + " document(s) that matched query '"
            + query + "':");

        // compute alternative spelling
        String bestAlternative = sc.didYouMean(queryString);

        Query alternativeQuery
            = queryParser.parse(bestAlternative);
        TopDocs results2
            = luceneIndex.search(alternativeQuery,MAX_HITS);

        System.out.println("Found " + results2.totalHits
            + " document(s) matched best alternate '"
            + bestAlternative + "':");
    }
}

在这之中 didYouMean 是最酷的方法.他通过我们之前构建的语言模型来推测最适合的替代单词,如果没有找到就返回原来的结果.

bestAlternative = sc.didYouMean(query);

sc是 TrainSpellChecker 类,这个方法是 CompiledSpellChecker implement SpellChecker 类.而这是整个教程中最有趣也是最有挑战性的一个地方.

public String didYouMean(String receivedMsg) {
    String msg = this.normalizeQuery(receivedMsg);
    if(msg.length() == 0) {
        return msg;
    } else {
        CompiledSpellChecker.DpSpellQueue queue = new CompiledSpellChecker.DpSpellQueue();
        CompiledSpellChecker.DpSpellQueue finalQueue = new CompiledSpellChecker.DpSpellQueue();
        this.computeBestPaths(msg, queue, finalQueue);
        if(finalQueue.isEmpty()) {
            return msg;
        } else {
            CompiledSpellChecker.State bestState = (CompiledSpellChecker.State)finalQueue.poll();
            return bestState.output().trim();
        }
    }
}