【问题标题】：Most efficient way to check if a pair is present in a text检查文本中是否存在对的最有效方法
【发布时间】：2014-01-20 18:14:53
【问题描述】：

简介：

许多情绪分析程序使用的功能之一是通过根据词典为相关的一元、二元或配对分配特定分数来计算的。更详细：

一个示例词典可以是：

//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up   -0.62
how---i still -0.62

给定一个示例文本 T，对于 T 中的每个一元、二元或对，我想检查词典中是否存在对应关系。

unigram\bigram 部分很简单：我在 Map 中加载词典，然后迭代我的文本，检查字典中是否存在的每个单词。我的问题是检测配对。

我的问题：

检查文本中是否存在特定对的一种方法是迭代整个对的词典并在文本上使用正则表达式。如果文本中存在“start_of_pair.*end_of_pair”，则检查词典中的每个单词。这似乎非常浪费，因为我必须为每个要分析的文本迭代整个词典。有关如何以更智能的方式执行此操作的任何想法？

【问题讨论】：

作为简单的第一遍，您可以查找该对的第一个词，如果找到，则在文本的其余部分中查找第二个词。

标签： java regex dictionary pattern-matching sentiment-analysis

【解决方案1】：

可以将二元组的频率图实现为：

Map<String, Map<String, Integer> bigramFrequencyMap = new TreeMap<>();

用初始频率为 0 的所需二元组填充地图。 第一个词位，第二个词位，用于频率计数。

static final int MAX_DISTANCE = 5;

然后词法扫描将保留最后的 #MAX_DISTANCE 词位。

List<Map<String, Integer>> lastLexemesSecondFrequencies = new ArrayList<>();

void processLexeme() {
     String lexeme = readLexeme();

     // Check whether there is a bigram:
     for (Map<String, Integer> prior : lastLexemesSecondFrequencies) {
          Integer freq = prior.get(lexeme);
          if (freq != null) {
              prior.put(lexeme, 1 + freq);
          }
     }

     Map<String, Integer> lexemeSecondFrequencies =
             bigramFrequencyMap.get(lexeme);
     if (lexemeSecondFrequencies != null) {
         // Could remove lexemeSecondFrequencies if present in lastLexemes.
         lastLexems.add(0, lexemeSecondFrequencies); // addFirst
         if (lastLexemes.size() > MAX_DISTANCE) {
             lastLexemes.remove(lastLexemes.size() - 1); // removeLast
         }
     }
}

优化是保留bigrams后半部分，并且只处理注册的bigrams。

【讨论】：

【解决方案2】：

最后我以这种方式解决了它：我将配对词典加载为Map<String, Map<String, Float>> - 其中第一个键是配对的前半部分，内部映射包含该键开始的所有可能结尾和相应的情绪值。

基本上，我有一个可能的结尾列表（enabledTokens），每次我读到一个新的标记时我都会增加这个列表 - 然后我搜索这个列表以查看当前标记是否是之前某个标记的结尾对。

通过一些修改以防止之前的标记被立即用于结束，这是我的代码：

private Map<String, Map<String, Float>> firstPartMap;
private List<LexiconPair> enabledTokensForUnigrams, enabledTokensForBigrams;
private Queue<List<LexiconPair>> pairsForBigrams; //is initialized with two empty lists
private Token oldToken;

public void parseToken(Token token) {
    String unigram = token.getText();
    String bigram = null;
    if (oldToken != null) {
        bigram = oldToken.getText() + " " + token.getText();
    }

    checkIfPairMatchesAndUpdateFeatures(unigram, enabledTokensForUnigrams);
    checkIfPairMatchesAndUpdateFeatures(bigram, enabledTokensForBigrams);

    List<LexiconPair> pairEndings = toPairs(firstPartMap.get(unigram));
    if(bigram!=null)pairEndings.addAll(toPairs(firstPartMap.get(bigram)));
    pairsForBigrams.add(pairEndings);

    enabledTokensForUnigrams.addAll(pairEndings);
    enabledTokensForBigrams.addAll(pairsForBigrams.poll());

    oldToken = token;
}
private void checkIfPairMatchesAndUpdateFeatures(String text, List<LexiconPair> listToCheck) {
    Iterator<LexiconPair> iter = listToCheck.iterator();
    while (iter.hasNext()) {
        LexiconPair next = iter.next();
        if (next.getText().equals(text)) {
            float val = next.getValue();
            POLARITY polarity = getPolarity(val);
            for (LexiconFeatureSubset lfs : lexiconsFeatures) {
                lfs.handleNewValue(Math.abs(val), polarity);
            }
            //iter.remove();
            //return; //remove only 1 occurrence
        }
    }
}

【讨论】：