使用字典列表更改字符串答案

【问题标题】：Altering string using a list of dictionaries使用字典列表更改字符串
【发布时间】：2019-07-11 14:29:31
【问题描述】：

背景

我正在使用 NeuroNER http://neuroner.com/ 来标记文本数据 sample_string，如下所示。

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2000 and her number is 1111112222'

输出（使用 NeuroNER）

我的输出是字典列表dic_list

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2000'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '1111112222'}]

传奇

id = 文本 ID

type = 正在识别的文本类型

start = 已识别文本的起始位置

end = 已识别文本的结束位置

text = 已识别的文本

目标

由于text（例如Jane）的位置由start 和end 给出，我想在我的列表@987654340 中将每个text 从dic_list 更改为**BLOCK** @

期望的输出

sample_string = 'Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**

问题

我尝试过Replacing a character from a certain index 和Edit the values in a list of dictionaries?，但它们并不是我想要的

如何实现我想要的输出？

【问题讨论】：

请显示您尝试使用的实际代码并解释具体是什么不起作用。
注意：开始和结束似乎与某些字段中“文本”的长度或文件中的位置不匹配。
dic_list 已更新。我为混乱道歉

标签： python python-3.x list loops dictionary

【解决方案1】：

如果您想要基于 start 和 end 索引的解决方案，

您可以使用 dic_list 的间隔 between 来了解您需要哪些部分。然后加入他们**BLOCK**。

试试这个：

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

parts_to_take = [(0, dic_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(dic_list, dic_list[1:])] + [(dic_list[-1]['end'], len(sample_string)-1)]
parts = [sample_string[start:end] for start, end in parts_to_take]

sample_string = '**BLOCK**'.join(parts)

print(sample_string)

【讨论】：

有一种更易读的方法，请参阅this solution。
@Error-SyntacticalRemorse 很好，在输入字符串本身上工作。
该链接可能会失效，但您可以将其作为另一种选择添加到您的答案中。我的答案使用替换，而你的确实开始和结束。

【解决方案2】：

我可能遗漏了一些东西，但你可以使用.replace()：

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

for dic in dic_list:
    sample_string = sample_string.replace(dic['text'], '**BLOCK**')
print(sample_string)

虽然regex 可能会更快：

import re
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

pattern = re.compile('|'.join(dic['text'] for dic in dic_list))
result = pattern.sub('**BLOCK**', sample_string)
print(result)

两个输出：

Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**

【讨论】：

这将起作用，除非文本的其他部分与要替换的部分相匹配，不应被替换。（例如，一个叫“见过”的人或某事）
是的，我会更新它。尽管如果他们只是删除有意义的信息，这应该可行。如果start 和end 正确排列会有所帮助。

【解决方案3】：

根据@Error - Syntactical Remorse的建议

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

offset = 0
filler = '**BLOCK**'
for dic in dic_list:
    sample_string = sample_string[:dic['start'] + offset ] + filler + sample_string[dic['end'] + offset + 1:]
    offset += dic['start'] - dic['end'] + len(filler) - 1
print(sample_string)

【讨论】：