【问题标题】:C++ string diff (a la Python's difflib)C++ 字符串 diff(类似于 Python 的 difflib)
【发布时间】:2010-09-21 03:49:50
【问题描述】:

我正在尝试区分两个字符串,以确定它们是否仅在字符串结构的一个数字子集中有所不同;例如,

varies_in_single_number_field('foo7bar', 'foo123bar')
# Returns True, because 7 != 123, and there's only one varying
# number region between the two strings.

在 Python 中,我可以使用 difflib 来完成此操作:

import difflib, doctest

def varies_in_single_number_field(str1, str2):
    """
    A typical use case is as follows:
        >>> varies_in_single_number_field('foo7bar00', 'foo123bar00')
        True

    Numerical variation in two dimensions is no good:
        >>> varies_in_single_number_field('foo7bar00', 'foo123bar01')
        False

    Varying in a nonexistent field is okay:
        >>> varies_in_single_number_field('foobar00', 'foo123bar00')
        True

    Identical strings don't *vary* in any number field:
        >>> varies_in_single_number_field('foobar00', 'foobar00')
        False
    """
    in_differing_substring = False
    passed_differing_substring = False # There should be only one.
    differ = difflib.Differ()
    for letter_diff in differ.compare(str1, str2):
        letter = letter_diff[2:]
        if letter_diff.startswith(('-', '+')):
            if passed_differing_substring: # Already saw a varying field.
                return False
            in_differing_substring = True
            if not letter.isdigit(): return False # Non-digit diff character.
        elif in_differing_substring: # Diff character not found - end of diff.
            in_differing_substring = False
            passed_differing_substring = True
    return passed_differing_substring # No variation if no diff was passed.

if __name__ == '__main__': doctest.testmod()

但我不知道如何为 C++ 找到类似 difflib 的东西。欢迎使用替代方法。 :)

【问题讨论】:

  • 我只是想澄清一下,字母重要还是数字重要?在我看来,您希望每对数字系列,您只希望对有任何差异?
  • 所有字符必须相同,除了一个“数字字符串位置”必须以数字变化。这更有意义吗?
  • 所以基本上,您正在寻找 A1*B1 == A2B2 其中 * 是数字序列?
  • 我认为这个描述并不完全正确......更像是:a =~ /(.*?)(\d*)(.*)/; b =~ /(.*?)(\d*)(.*)/ 其中至少有一个中心分组必须是非空的,数字的中心分组必须在数值上不相等,并且第一个和第三个分组必须相等。
  • 好的,我想我差不多有了解决方案,再过 1 分钟 :)

标签: c++ python algorithm diff


【解决方案1】:

这可能有效,它至少通过了您的演示测试: 编辑:我做了一些修改来处理一些字符串索引问题。我相信现在应该不错了。

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <cctype>

bool starts_with(const std::string &s1, const std::string &s2) {
    return (s1.length() <= s2.length()) && (s2.substr(0, s1.length()) == s1);
}

bool ends_with(const std::string &s1, const std::string &s2) {
    return (s1.length() <= s2.length()) && (s2.substr(s2.length() - s1.length()) == s1);
}

bool is_numeric(const std::string &s) {
    for(std::string::const_iterator it = s.begin(); it != s.end(); ++it) {
        if(!std::isdigit(*it)) {
                return false;
        }
    }
    return true;
}

bool varies_in_single_number_field(std::string s1, std::string s2) {

    size_t index1 = 0;
    size_t index2 = s1.length() - 1;

    if(s1 == s2) {
        return false;
    }

    if((s1.empty() && is_numeric(s2)) || (s2.empty() && is_numeric(s1))) {
        return true;
    }

    if(s1.length() < s2.length()) {
        s1.swap(s2);
    }

    while(index1 < s1.length() && starts_with(s1.substr(0, index1), s2)) { index1++; }
    while(ends_with(s1.substr(index2), s2)) { index2--; }

    return is_numeric(s1.substr(index1 - 1, (index2 + 1) - (index1 - 1)));

}

int main() {
    std::cout << std::boolalpha << varies_in_single_number_field("foo7bar00", "foo123bar00") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("foo7bar00", "foo123bar01") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("foobar00", "foo123bar00") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("foobar00", "foobar00") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("7aaa", "aaa") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("aaa7", "aaa") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("aaa", "7aaa") << std::endl;
    std::cout << std::boolalpha << varies_in_single_number_field("aaa", "aaa7") << std::endl;
}

基本上,它会寻找一个包含 3 个部分的字符串,string2 以 part1 开头,string2 以 part3 结尾,part2 只是数字。

【讨论】:

  • 您可能需要重新设计它以使其成为 O(n)。你现在有一个二次解。 (您不需要继续检查字符串是否以给定的子字符串开头 - 只需逐个字符检查)。
【解决方案2】:

这可能有点矫枉过正,但你可以使用 boost 来连接 python。最坏的情况是,difflib是用纯python实现的,而且不会太长。应该可以从python移植到C...

【讨论】:

  • 我真的希望我可以使用 Python 库,但是有一些外力阻止我这样做。 :) 也许一个端口是有序的。
【解决方案3】:

您可以采用一种特别的方法:您希望匹配字符串 s 和 s',其中 s=abc 和 s'=ab'c,并且 b 和 b' 应该是两个不同的数字(可能为空) .所以:

  1. 比较左边的字符串,一个字符一个字符,直到找到不同的字符,然后停止。你
  2. 类似地,从右边开始比较字符串,直到找到不同的字符,或者点击左边的标记。
  3. 然后检查中间的余数,看看它们是否都是数字。

【讨论】:

  • 大声笑,这是对我的答案中实现的算法的描述:-P
  • 好电话。我猜是同时回答。
【解决方案4】:

使用 boost::regex 之类的东西怎么样?

// 伪代码,可能编译也可能不编译 bool match_except_numbers(const std::string& s1, const std::string& s2) { static const boost::regex fooNumberBar("foo\\d+bar"); 返回 boost::match(s1, fooNumberBar) && boost::match(s2, fooNumberBar); }

【讨论】:

  • 不幸的是,正则表达式的功能不足以处理变化的 /\d+/ 出现在字符串中的未知位置的情况——我相信你至少需要一个上下文无关的语法,尽管我没有并没有真正坐下来考虑解决这个问题的语法。
【解决方案5】:

@Evan Teran:看起来我们是并行执行的——我的 O(n) 实现的可读性明显降低:

#include <cassert>
#include <cctype>
#include <string>
#include <sstream>
#include <iostream>

using namespace std;

ostringstream debug;
const bool DEBUG = true;

bool varies_in_single_number_field(const string &str1, const string &str2) {
    bool in_difference = false;
    bool passed_difference = false;
    string str1_digits, str2_digits;
    size_t str1_iter = 0, str2_iter = 0;
    while (str1_iter < str1.size() && str2_iter < str2.size()) {
        const char &str1_char = str1.at(str1_iter);
        const char &str2_char = str2.at(str2_iter);
        debug << "str1: " << str1_char << "; str2: " << str2_char << endl;
        if (str1_char == str2_char) {
            if (in_difference) {
                in_difference = false;
                passed_difference = true;
            }
            ++str1_iter, ++str2_iter;
            continue;
        }
        in_difference = true;
        if (passed_difference) { /* Already passed a difference. */
            debug << "Already passed a difference." << endl;
            return false;
        }
        bool str1_char_is_digit = isdigit(str1_char);
        bool str2_char_is_digit = isdigit(str2_char);
        if (str1_char_is_digit && !str2_char_is_digit) {
            ++str1_iter;
            str1_digits.push_back(str1_char);
        } else if (!str1_char_is_digit && str2_char_is_digit) {
            ++str2_iter;
            str2_digits.push_back(str2_char);
        } else if (str1_char_is_digit && str2_char_is_digit) {
            ++str1_iter, ++str2_iter;
            str1_digits.push_back(str1_char);
            str2_digits.push_back(str2_char);
        } else { /* Both are non-digits and they're different. */
            return false;
        }
    }
    if (in_difference) {
        in_difference = false;
        passed_difference = true;
    }
    string str1_remainder = str1.substr(str1_iter);
    string str2_remainder = str2.substr(str2_iter);
    debug << "Got to exit point; passed difference: " << passed_difference
        << "; str1 digits: " << str1_digits
        << "; str2 digits: " << str2_digits
        << "; str1 remainder: " << str1_remainder
        << "; str2 remainder: " << str2_remainder
        << endl;
    return passed_difference
        && (str1_digits != str2_digits)
        && (str1_remainder == str2_remainder);
}

int main() {
    assert(varies_in_single_number_field("foo7bar00", "foo123bar00") == true);
    assert(varies_in_single_number_field("foo7bar00", "foo123bar01") == false);
    assert(varies_in_single_number_field("foobar00", "foo123bar00") == true);
    assert(varies_in_single_number_field("foobar00", "foobar00") == false);
    assert(varies_in_single_number_field("foobar00", "foobaz00") == false);
    assert(varies_in_single_number_field("foo00bar", "foo01barz") == false);
    assert(varies_in_single_number_field("foo01barz", "foo00bar") == false);
    if (DEBUG) {
        cout << debug.str();
    }
    return 0;
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-28
    • 2016-04-24
    • 1970-01-01
    • 1970-01-01
    • 2021-05-06
    相关资源
    最近更新 更多