当在其中一个字符串中插入空格时匹配两个字符串中的元素答案

【问题标题】：match elements in two strings when whitespace inserted in one of them当在其中一个字符串中插入空格时匹配两个字符串中的元素
【发布时间】：2021-11-13 14:13:01
【问题描述】：

我有大量的字符串对，例如：

s1 = 'newyork city lights are yellow'
s2 = ' the city of new york is large'

我想写一个函数来获取 s1 和 s2（不管顺序如何）并输出：

s1_output = 'new york city lights are yellow'
s2_output = 'the city of new york is large'

这样 s2 中的 newyork 被分离到 new york 或者至少，一种在第二个字符串中仅插入一个字符即可找到与其他元素匹配的元素的方法。

匹配的token是事先不知道的，在文中不是强制的有什么想法吗？

【问题讨论】：

可能类似于s.replace('newyork', 'new york').strip()?
这是一个例子..你事先不知道元素
在这种情况下，为什么我们要用new york 替换newyork？我想那部分对我来说并不是很清楚
假设我有两个字符串，其中一个元素（即棒球和“棒球”）之间存在明显的模糊匹配，我想找到一种方法来提取该元素并规范化这两个文本以相同的格式。
这能回答你的问题吗？ stackoverflow.com/a/50534532/10237506

标签： python string nlp string-matching

【解决方案1】：

这样的东西可以工作

s1 = 'newyork city lights are yellow'
s2 = ' the city of new york is large'

# Get rid of leading/trailing whitespace
s1 = s1.strip()
# Split string into list of words, delimeter is ' ' by default
words_s1 = s1.split()

s2 = s2.strip()
words_s2 = s2.split()

# For each word in list 1, compare it to adjacent (concatenated) words in list 2
for word in words_s1:
    for i in range(len(words_s2)-1):
        if word == words_s2[i] + words_s2[i+1]:
            print(f"Word #{words_s1.index(word)} in s1 matches words #{i} and #{i+1} in s2")

它可以按照您描述的方式匹配单词。基本上这个想法是你遍历列表 1 并检查它与列表 2 中的相邻单词。

然后您也可以以相反的方式循环（循环通过 s2 并检查它是否等于 s1 中的相邻单词），以检查两个方向。

您需要跟踪匹配项的位置，然后您只需要使用该信息构建一个新字符串。

【讨论】：