如何判断一个字符串是英文句子还是代码？答案

【问题标题】：How to determine if a string is English sentence or code?如何判断一个字符串是英文句子还是代码？
【发布时间】：2014-12-16 04:34:37
【问题描述】：

考虑以下两个字符串，第一个是代码，第二个是英文句子（准确地说是短语）。我怎样才能检测到第一个是代码而第二个不是。

1. for (int i = 0; i < b.size(); i++) {
2. do something in English (not necessary to be a sentence).

我正在考虑计算特殊字符（例如“=”、“;”、“++”等），并将 if 设置为某个阈值。有没有更好的方法来做到这一点？任何Java库？

请注意，代码可能无法解析，因为它不是完整的方法/语句/表达式。

我的假设是英语句子很规则，它很可能只包含“，”，“。”，“_”，“（”，“）”等。它们不包含这样的东西：@987654324 @

【问题讨论】：

天哪，这很难，老实说，我会对此进行一些研究，并在您获得一些代码后将其带到这里
我正在寻找一些捷径。
没错，但我们是程序员，不是头脑风暴者。我们无法帮助您提出想法，特别是如果它像这个一样开放的话......请返回代码，然后我们将能够为您提供帮助
我相信你需要做更多的事情然后解决halting problem。祝你好运！您也许可以 cheat 您可以手动标记文字，例如 "text:"
代码是否保证是Java代码？某些语言的代码也是有效的英语。 en.wikipedia.org/wiki/Shakespeare_(programming_language)

标签： java string nlp

【解决方案1】：

查看词法分析和解析（就像您在编写编译器一样）。如果您不需要完整的语句，您甚至可能不需要解析器。

【讨论】：

你的回答给了我一些提示，我现在有了一些想法。 +!

【解决方案2】：

您可以使用 Java 解析器或使用 BNF 创建一个解析器，但这里的问题是您说代码可能无法解析，因此它会失败。

我的建议：使用一些自定义正则表达式来检测代码中的特殊模式。尽可能多地使用以获得良好的成功率。

一些例子：

for\s*\(（for循环）
while\s*\(（while 循环）
[a-zA-Z_$][a-zA-Z\d_$]*\s*\( (constructor)
\)\s*\{（块/方法的开始）
...

是的，这是一个很长的机会，但看看你想要什么，你没有多少可能性。

【讨论】：

【解决方案3】：

你可以试试 OpenNLP 句子解析器。它返回一个句子的 n 个最佳解析。对于大多数英语句子，它至少返回一个。我相信，对于大多数代码 sn-ps 它不会返回任何内容，因此您可以确定它不是英文句子。

使用此代码进行解析：

    // Initialize the sentence detector
    final SentenceDetectorME sdetector = EasyParserUtils
            .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);

    // Initialize the parser
    final Parser parser = EasyParserUtils
            .getOpenNLPParser(Constants.PARSER_DATA_LOC);

    // Get sentences of the text
    final String sentences[] = sdetector.sentDetect(essay);

    // Go through the sentences and parse each
    for (final String sentence : sentences) {
        // Parse the sentence, produce only 1 parse
        final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
        if (parses.length == 0) {
            // Most probably this is code
        }
        else {
            // An English sentence
        }
    }

这些是代码中使用的两个辅助方法（来自 EasyParserUtils）：

public static Parser getOpenNLPParser(final String parserDataURL) {
    try (final InputStream isParser = new FileInputStream(parserDataURL);) {
        // Get model for the parser and initialize it
        final ParserModel parserModel = new ParserModel(isParser);
        return ParserFactory.create(parserModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}

和

public static SentenceDetectorME getOpenNLPSentDetector(
        final String sentDetDataURL) {
    try (final InputStream isSent = new FileInputStream(sentDetDataURL)) {
        // Get models for sentence detector and initialize it
        final SentenceModel sentDetModel = new SentenceModel(isSent);
        return new SentenceDetectorME(sentDetModel);
    }
    catch (final IOException e) {
        e.printStackTrace();
        return null;
    }
}

【讨论】：

【解决方案4】：

无需重新发明轮子，编译器已经为您完成了这项工作。任何编译过程的第一阶段都会检查文件中的标记是否在语言范围内。这当然对我们没有帮助，因为英语和 java 在这方面没有区别。然而，第二阶段，句法分析，将打印任何英文形成的句子而不是 java 代码（或任何其他不是正确的 java）的错误。因此，与其使用外部库并尝试使用替代方法，不如使用已经可用的 java 编译器？

你可以有一个包装类，比如

public class Test{

    public static void main(){

         /*Insert code to check here*/

    }

}

它被编译，如果它运行良好，那么你就知道它是有效的代码。当然，它不适用于不完整的代码 sn-ps，例如您在示例中放入的没有结束括号的 for 循环。如果编译不好，您可以通过多种方式威胁该字符串，例如尝试使用您自己的自制伪英语语法分析器来解析它，该分析器由 flex-bison 制作，例如 GNU 用于制作 GCC 的工具。我不知道你想用你试图制作的程序来完成什么，但这样你就可以知道它是代码、手工制作的英文句子，还是你不应该关心的垃圾。解析自然语言真的很困难，而且现在现代方法使用不准确的统计方法，因此它们并不总是正确的，这是您可能不希望在程序中出现的。

【讨论】：

这假定代码不是一个完整的类。它还假定不会出现编程错误。

【解决方案5】：

对于一个非常简单的方法，它似乎在某些样本上效果很好。取出System.out。它仅用于说明目的。从示例输出中可以看出，代码 cmets 看起来像文本，因此如果将大型非 javadoc 块 cmets 混入代码中，您可能会得到误报。硬编码的阈值是我的估计。随意微调它们。

public static void main(String[] args) {
    for(String arg : args){
        System.out.println(arg);
        System.out.println(codeStatus(arg));
    }
}

static CodeStatus codeStatus (String string) {
    String[] words = string.split("\\b");
    int nonText = 0;
    for(String word: words){
        if(!word.matches("^[A-Za-z][a-z]*|[0-9]+(.[0-9]+)?|[ .,]|. $")){
            nonText ++;
        }
    }
    System.out.print("\n");
    double percentage = ((double) nonText) / words.length;
    System.out.println(percentage);
    if(percentage > .2){
        return CodeStatus.CODE;
    }
    if(percentage < .1){
        return CodeStatus.TEXT;
    }
    return CodeStatus.INDETERMINATE;
}

enum CodeStatus {
    CODE, TEXT, INDETERMINATE
}

样本输出：

You can try the OpenNLP sentence parser. It returns the n best parses for a sentence. For most English sentences it returns at least one. I believe, that for most code snippets it won't return any and hence you can be quite sure it is not an English sentence.

0.0297029702970297
TEXT
Use this code for parsing:

0.18181818181818182
INDETERMINATE
    // Initialize the sentence detector

0.125
INDETERMINATE
    final SentenceDetectorME sdetector = EasyParserUtils
            .getOpenNLPSentDetector(Constants.SENTENCE_DETECTOR_DATA);

0.6
CODE
    // Initialize the parser

0.16666666666666666
INDETERMINATE
    final Parser parser = EasyParserUtils
            .getOpenNLPParser(Constants.PARSER_DATA_LOC);

0.5333333333333333
CODE
    // Get sentences of the text

0.1
INDETERMINATE
    final String sentences[] = sdetector.sentDetect(essay);

0.38461538461538464
CODE
    // Go through the sentences and parse each

0.07142857142857142
TEXT
    for (final String sentence : sentences) {
        // Parse the sentence, produce only 1 parse
        final Parse[] parses = ParserTool.parseLine(sentence, parser, 10);
        if (parses.length == 0) {
            // Most probably this is code
        }
        else {
            // An English sentence
        }
    }

0.2537313432835821
CODE
and these are the two helper methods (from EasyParserUtils) used in the code:

0.14814814814814814
INDETERMINATE
public static Parser getOpenNLPParser(final String parserDataURL) {
    try (final InputStream isParser = new FileInputStream(parserDataURL);) {
        // Get model for the parser and initialize it
        final ParserModel parserModel = new ParserModel(isParser);
        return ParserFactory.create(parserModel);
    }
    catch (final IOException e) {

0.3835616438356164
CODE

【讨论】：

【解决方案6】：

基本思想是将字符串转换为tokens的集合。例如，上面的代码行可能变成“KEY,SEPARATOR,ID,ASSIGN,NUMBER,SEPARATOR,...”。然后我们可以使用简单的规则将代码与英语分开。

check out the code here

【讨论】：

【解决方案7】：

Here 是一个完美且安全的解决方案。基本思想是首先获取所有可用的关键字和特殊字符，然后使用集合来构建分词器。例如，问题中的代码行变为“KEY,SEPARATOR,ID,ASSIGN,NUMBER,SEPARATOR,...”。然后我们可以使用简单的规则将代码与英文分开。

【讨论】：