【问题标题】:Fuzzy matching multiple words in string模糊匹配字符串中的多个单词
【发布时间】:2014-03-04 03:15:02
【问题描述】:

我正在尝试利用 Levenshtein Distance 的帮助在 OCR 页面上查找模糊关键字(静态文本)。
为此,我想给出允许的错误百分比(例如 15%)。

string Keyword = "past due electric service";

由于关键字长度为 25 个字符,我希望允许出现 4 个错误(25 * .15 向上取整)
我需要能够将其与...进行比较...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank 
                          you! current electric service total balances unpaid 7 
                          days after the total due date are subject to a late 
                          charge of 7.5% of the amount due or $2.00, whichever/5 
                          greater. "

这就是我现在的做法......

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202   
int NumberOfErrorsAllowed = 4;   
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205

很明显,KeywordOCR_Text 中找不到(它不应该是这样)。但是,使用 Levenshtein 的距离,错误的数量小于 15% 的余地(因此我的逻辑说它找到了)。

有人知道更好的方法吗?

【问题讨论】:

标签: c# ocr levenshtein-distance fuzzy-search


【解决方案1】:

使用子字符串回答了我的问题。发布以防其他人遇到相同类型的问题。有点不正统,但对我来说效果很好。

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage

    //Look for best match with 1 less character than it should have, then the correct amount of characters.
    //And last, with 1 more character. (This is because one letter can be recognized as 
    //two (W -> VV) and visa versa) 

for (int i = 0; i < 3; i++) 
{
    for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
    {
        string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
        int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
        int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);

        if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
        {
            PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
            LowestLevenshteinNumber = lNumber;
        }
    }
    TextLengthBuffer++;
}




public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}

【讨论】:

    【解决方案2】:

    我认为它不起作用,因为您的字符串中有很大一部分是匹配的。所以我要做的是尝试将您的关键字分成单独的单词。

    然后在您的 OCR_TEXT 中找到这些单词匹配的所有位置。

    然后查看它们匹配的所有那些地方,看看其中任何 4 个地方是连续的并且与原始短语匹配。

    不确定我的解释是否清楚?

    【讨论】:

    • 如果我正确理解您的答案,我将失去声明 NumberOfErrorsAllowed 的能力。没有?
    • 是与否;这将是每个单词。
    • 每个单词都不起作用。一个词可能是“我”,如果它被识别为“1”,我就会失去结果。请参阅我想出的答案。谢谢
    猜你喜欢
    • 2018-06-12
    • 1970-01-01
    • 1970-01-01
    • 2018-07-01
    • 2010-12-15
    • 1970-01-01
    • 2012-02-14
    • 2014-11-02
    • 2017-08-03
    相关资源
    最近更新 更多