自然语言处理答案

【问题标题】：Natural Language Processing自然语言处理
【发布时间】：2010-09-25 06:06:28
【问题描述】：

我在一个文件中有数千个句子。我只想找到正确/有用的英语单词。自然语言处理有可能吗？

例句：

~@^.^@~ tic 但有时世界著名的 tac Zorooooooooooooo

我只想提取像这样的英文单词

tic 世界好名声

任何建议我怎样才能做到这一点。提前致谢

【问题讨论】：

但是你不想解压sometimes？
您可能对The WiLI benchmark dataset for written language identification论文和lidtk感兴趣

标签： java php nlp

【解决方案1】：

您可以使用WordNet API 来查找单词。

【讨论】：

@Shahid 此外，使用 WordNet（英语），该示例中的有效词是：{tic, but, but,有时, world, good, Famous}。如果您想避免某些单词（即对您来说“非“有用”），您需要一个停用词列表，如@regexhacks 所述。如果您想要其他语言，可以使用一些非英语 WordNet 类库：en.wikipedia.org/wiki/WordNet#Other_languages

【解决方案2】：

您需要编译一个停用词列表（一旦您不想参与搜索），然后您可以使用该停用词列表过滤您的搜索。有关详细信息，您应该考虑查看这些维基百科文章

【讨论】：

【解决方案3】：

您可以使用使用字符 n-gram 统计信息的语言猜测器。通常只需要少量材料（用于训练和分类）。可以在此处找到文献和实现的链接：

http://odur.let.rug.nl/~vannoord/TextCat/

方法很简单：

为每种语言收集少量文本。
提取并计算文本中出现的 1-gram 和 5-gram。
按频率对这些 n-gram 排序，取最好的，比如 300。这形成了语言的指纹。

如果要对文本或句子进行分类，请应用步骤 2 和 3，并将生成的指纹与训练期间收集的指纹进行比较。根据 n-gram 的排名差异计算分数，分数最低的语言获胜。

【讨论】：

【解决方案4】：

您可以使用 Python 来实现这一点。您正在寻找的是过滤英文单词。

首先标记句子。（把句子分成单词）
使用 Python langdetect 库查看是否为英文单词
根据 langdetect 输出过滤所有英文单词。

如何安装库：

$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

https://pypi.python.org/pypi/langdetect?

P.S.：不要指望它总是能正常工作：

>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'

【讨论】：

@Novice 不确定您是否可以使用 python 解决方案。如果可以，请投票。否则我可以放弃这个答案。

【解决方案5】：

package com;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;

public class TranscriberDemo {       

    public static void main(String[] args) throws Exception {

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("en-us");
        configuration.setDictionaryPath("Sample Dict File_2.dic");
        configuration.setLanguageModelPath("Sample Language Modeller_2.lm");

        //configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        //configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        //configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/language/en-us.lm.dmp");

    StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
    InputStream stream = new FileInputStream(new File("test.wav"));

        recognizer.startRecognition(stream);
    SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
        System.out.format("Hypothesis: %s\n", result.getHypothesis());
    }
    recognizer.stopRecognition();
    }
}

【讨论】：

上面的代码编译得很好，但是一旦编译完成，java控制台就会自动停止。 eclipse juno 使用和自然语言处理