如何从列中的字符串中提取与python列表中的另一个字符串匹配的子字符串答案

【问题标题】：How to extract a substring from a string in a column, that matches another string in a list in python如何从列中的字符串中提取与python列表中的另一个字符串匹配的子字符串
【发布时间】：2020-09-20 20:24:02
【问题描述】：

我有一个如下的数据框：

     col 1                                     col 2
0       59       538 Walton Avenue, Chester, FY6 7NP
1       62 42 Chesterton Road, Peterborough, FR7 2NY
2      179       3 Wallbridge Street, Essex, 4HG 3HT
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP

列表类似于：

[Stevenage, Essex, Coventry, Chester]

按照此处的解决方案：How to check if Pandas rows contain any full string or substring of a list? 如下所示：

city_list = list(cities["name"])
df["col3"] = np.where(df["col2"].str.contains('|'.join(city_list)), df["col2"], '')

我发现 col 2 中的一些匹配列表中的字符串，但 col3 与 col2 相同。我希望 col3 成为列表中的值，而不是与 col3 相同。这将是：

     col 1                                     col 2     col3
0       59       538 Walton Avenue, Chester, FY6 7NP  Chester 
1       62 42 Chesterton Road, Peterborough, FR7 2NY 
2      179       3 Wallbridge Street, Essex, 4HG 3HT    Essex
3      180     6 Stevenage Avenue, Coventry, 7PY 9NP Coventry

我试过了：

pat = "|".join(cities.name)
df.insert(0, "name", df["col2"].str.extract('(' + pat + ')', expand = False))

但这会返回一个错误，说 456 个输入，而预期为 1。

还有：

df["col2"] = df["col2"].apply(lambda x: difflib.get_close_matches(x, cities["name"])[0])
df.merge(cities)

但这回来时错误列表索引超出范围。

有没有办法做到这一点？ df1 大约有 160,000 个条目，col2 中的每个地址来自不同国家，因此没有标准的书写方式，而城市列表大约有 170,000 个条目

谢谢

【问题讨论】：

到目前为止你尝试过什么？给我们看一些代码

标签： python string merge

【解决方案1】：

你可以这样做：

city_list = ["Stevenage", "Essex", "Coventry", "Chester"]

def get_match(row):
    col_2 = row["col 2"].replace(",", " ").split() # Here you should process the string as you want
    for c in city_list:
        if difflib.get_close_matches(col_2, c)
            return c
    return ""

df["col 3"] = df.apply(lambda row: get_match(row), axis=1)

【讨论】：

.replace(",", " ")。 split 不起作用，因为它被存储为一个对象，因此目前很难匹配（需要一个多小时才能匹配 10,000 行的子集）

【解决方案2】：

依靠这样的辅助功能：

df = pd.DataFrame({'col 1': [59, 62, 179, 180],
                   'col 2': ['538 Walton Avenue, Chester, FY6 7NP',
                             '42 Chesterton Road, Peterborough, FR7 2NY',
                             '3 Wallbridge Street, Essex, 4HG 3HT',
                             '6 Stevenage Avenue, Coventry, 7PY 9NP'
                             ]})

def aux_func(x):

    # split by comma and select the interesting part ([1])
    x = x.split(',')
    x = x[1]

    aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']
    for v in aux_list:
        if v in x:
            return v
    return ""

df['col 3'] = [aux_func(name) for name in df['col 2']]

【讨论】：

不幸的是，它们不是标准地址，因此在 x 上按 [1] 进行索引并不总是让感兴趣的位感兴趣
感谢您的评论，最后我添加了一个新答案。希望这会有用。

【解决方案3】：

看看str.contains 测试模式是否匹配系列的函数：

df = pd.DataFrame([[59, '538 Walton Avenue, Chester,', 'FY6 7NP'],
                   [62, '42 Chesterton Road, Peterborough', '4HG 3HT'],
                   [179, '3 Wallbridge Street, Essex', '4HG 3HT'],
                   [180, '6 Stevenage Avenue, Coventry', '7PY 9NP']])
city_list = ["Stevenage", "Essex", "Coventry", "Chester"]
for city in city_list:
    df.loc[df[1].str.contains(city), 'match'] = city

【讨论】：

奇怪，我复制/粘贴了我的代码，一切又恢复了。您使用的是您的数据，还是我的答案中的数据？
我的数据，这可能是问题的来源，现在正在运行，所以会看到输出结果 - 必须重置笔记本
所以我又试了一次，它设法得到了一些，但有些匹配，虽然不是所有的都是完美的。例如，它设法找到 San Diego 和 Stuttgart，但某些匹配项（例如 Wangen 和 Eutze-Eltze）出现错误或部分匹配

【解决方案4】：

感谢您上次回复，试试这个

def aux_func(address):
    aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']

    # remove commas
    address = address.split(',')

    # avoide matches with the first part of the address
    if len(address)>1:
        # remove the first element of the address
        address = address[1:]

    for v in aux_list:
        for chunk in address:
            if v in chunk:
                return v

    return ""

df['col 3'] = [aux_func(address) for address in df['col 2']]

【讨论】：