【问题标题】:Match and tag the string of one column to the substring of another column将一列的字符串匹配并标记到另一列的子字符串
【发布时间】:2020-11-27 20:04:09
【问题描述】:

我需要 Python 代码,它采用 x,y 列中的字符串并匹配 Z 列中的这些子字符串,并将子字符串替换为子字符串的标记版本,如下所示

输入:未标记的子字符串

    Target  Effect  Sentence
0   "hsp9   "insulin sensitivity" "treatment of fhrs with doxycycline attenuated the decrease in enos and hsp90 expression but did not improve insulin sensitivity."
1   "hsp90"    "apoptosis"   "radicicol, an inhibitor of hsp90, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of apoptosis-related proteins."

输出:标记的子字符串

    Target  Effect  Sentence
0   "hsp90"    "insulin sensitivity"   "treatment of fhrs with doxycycline attenuated the decrease in enos and <e1>hsp90</e1> expression but did not improve <e2>insulin sensitivity</e2>."
1   "hsp90"    "apoptosis"    "radicicol, an inhibitor of <e1>hsp90</e1>, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of <e2>apoptosis</e2>-related proteins."

我想使用 pandas 和数据框来做到这一点。 使用上面的示例,我将如何完成这样的任务。

【问题讨论】:

  • 图像高度 discouraged,而不是包含示例输入以及可以复制粘贴的预期输出以及您尝试过的代码 sn-p。
  • 我改进了问题的格式

标签: python pandas automation nlp jupyter-notebook


【解决方案1】:

使用apply() 并将每一列视为 reg expr 匹配以进行 s 替换,这很简单。

import re
data = '''    Target  Effect  Sentence
0   hsp90   insulin sensitivity   "treatment of fhrs with doxycycline attenuated the decrease in enos and hsp90 expression but did not improve insulin sensitivity."
1   hsp90    apoptosis   "radicicol, an inhibitor of hsp90, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of apoptosis-related proteins."'''
a = [[t.strip() for t in re.split("  ",l) if t!=""]  for l in [re.sub("([0-9]+[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])

df["Sentence"] = df.apply(lambda r: re.sub(f"({r['Effect']})", r"<e2>\1</e2>", 
                          re.sub(f"({r['Target']})", r"<e1>\1</e1>", r["Sentence"])), axis=1)
print(df.to_string(index=False))


输出

Target               Effect                                                                                                                                                                                            Sentence
 hsp90  insulin sensitivity                                                "treatment of fhrs with doxycycline attenuated the decrease in enos and <e1>hsp90</e1> expression but did not improve <e2>insulin sensitivity</e2>."
 hsp90            apoptosis  "radicicol, an inhibitor of <e1>hsp90</e1>, enhances trail-induced <e2>apoptosis</e2> in human epithelial ovarian carcinoma cells by promoting activation of <e2>apoptosis</e2>-related proteins."

【讨论】:

    猜你喜欢
    • 2018-09-13
    • 1970-01-01
    • 1970-01-01
    • 2017-03-11
    • 2014-06-26
    • 2021-03-11
    • 1970-01-01
    • 2023-03-28
    • 1970-01-01
    相关资源
    最近更新 更多