Python：在字符串列表的字典值中查找和替换模式答案

【问题标题】：Python : find and replace patterns in the value of dictionary that is a list of stringsPython：在字符串列表的字典值中查找和替换模式
【发布时间】：2020-01-03 09:36:56
【问题描述】：

我有一个字典，其中包含一对键：值，其中值是字符串列表：

dictionarylst = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}

我还有一个可以是标记或二元组的单词列表：

wordslist = ["expression 1", "my expression", "other", "blah"]

我正在尝试将单词列表中的每个单词与字典中每个值中的每个文本进行匹配。当有匹配时，我想用空格替换那个模式（但保留其余文本）并将输出存储在具有相同键的新字典中。

这是我迄今为止尝试过的：

dictionarycleaned = {}
for key,value in dictionarylst.items():
    for text in value :
        for word in wordslist :
            if word in value :
                pattern = re.compile(r'\b({})\b'.format(word))
                matches = re.findall(pattern, text)
                dictionarycleaned[key] = [re.sub(i,' ', text) for i in matches]
            else :
                dictionarycleaned[key] = value

这仅匹配我的单词表中的一小部分模式。我尝试了不同的变体：比如将模式与每个值中的整个字符串列表匹配，或者在 dictionarylst 之前迭代 wordlist，但似乎没有什么可以清理我的所有数据（非常大）。

感谢您的建议。

【问题讨论】：

你的预期输出是什么？
预期的输出是一个字典，就像输入一样，但是文本被清除了。（因此代码中的dictionarycleaned = {}）

标签： python regex string list dictionary

【解决方案1】：

Pako 的答案很好，但您可以通过这些进一步优化 - 使用正则表达式生成替换 - 无需创建字典的副本：只需将值替换为新列表

完整代码

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
regexs = []
wordslist = ["expression 1", "my expression", "other", "blah"]
for word in wordslist:
    regexs.append(re.compile(r"\b({})\b".format(word)))
for key, value in dictionarylst.items():
    words = [regex.sub(w, ' ') for w in value for regex in regexs]
    dictionarylst[key] = words

pprint.pprint(dictionarycleaned)

【讨论】：

【解决方案2】：

replace() 是 Python 编程语言中的一个内置函数，它返回字符串的副本，其中所有出现的子字符串都被另一个子字符串替换。

例如

dictionarylst = {0:["example inside some sentence", "something else", "some 
                  blah"], 1:["testing", "some other word"],2:["a new expression",
                 "my cat is cute"]}

wordslist = ["expression 1", "my expression", "other", "blah"]
dictionarycleaned = {}

def match_pattern(wordslist,value):
    new_list = []
    for text in value:
        # temp variable hold latest updated text
        temp = text
        for word in wordslist:
            if word in text:
                # replace text string with whitespace if word in text
                temp = temp.replace(word,"")
        new_list.append(temp)
    return new_list


for k,v in dictionarylst.items():
    dictionarycleaned[k] = match_pattern(wordslist, v)

print(dictionarycleaned)

O/P：

{0: ['example inside some sentence', 'something else', 'some '], 1: ['testing', 'some  
 word'], 2: ['a new expression', 'my cat is cute']}

【讨论】：

我用 re.sub 试过了，效果很好（我改成 re.sub 因为我需要匹配单词边界）非常感谢。

【解决方案3】：

由于它是平面字符串替换，如果 wordlist 中的单词不能包含双引号（“），您可以简单地从 dict 创建一个 json 字符串，然后进行替换并从修改后的 json 字符串重新生成 dict。

下面给出一个示例程序

import json

d = {0:["example inside some sentence", "something else", "some blah"], 1:["testing", "some other word"], 2:["a new expression", "my cat is cute"]}
words = ["expression 1", "my expression", "other", "blah"]

json_str = json.dumps(d)
for w in words:
  str = str.replace(w, " ")

req_dict = json.loads(json_str)

这样你就可以摆脱多重循环

【讨论】：

【解决方案4】：

试试这个：

import re
import pprint

dictionarylst = {
    0: ["example inside some sentence", "something else", "some blah"],
    1: ["testing", "some other word"],
    2: ["a new expression", "my cat is cute"],
}
wordslist = ["expression 1", "my expression", "other", "blah"]

dictionarycleaned = dictionarylst.copy()
for key, value in dictionarylst.items():
    for n, text in enumerate(value):
        for word in wordslist:
            if word in text:
                dictionarycleaned[key][n] = re.sub(r"\b({})\b".format(word), " ", text)

pprint.pprint(dictionarycleaned)

输出：

pako@b00s:~/tests$ python dict.py 
{0: ['example inside some sentence', 'something else', 'some  '],
 1: ['testing', 'some   word'],
 2: ['a new expression', 'my cat is cute']}

【讨论】：