【问题标题】:Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandasPython:用单词列表替换句子中的一个单词并将新句子放在熊猫的另一列中
【发布时间】:2025-11-28 18:45:01
【问题描述】:

我有一个数据框,其中一些句子包含单词 'o'clock',我想用我拥有的小时列表替换之前提到的时间,并将新句子放在另一列中,如下所示:

data= {"sentences":["I have a class at ten o'clock", "she is my friend", "she goes to school at eight o'clock"]}
my_list=['two', 'three','five','ten']

我希望看到的是一个额外的列,新句子如下所示,其中时间更改为列表中的所有时间:

输出:

     sentences                            new_sentences
0    I have a class at ten o'clock        I have a class at two o'clock, I have a class at three o'clock,...
1    she is my friend                     she is my friend
2    she goes to school at eight o'clock  she goes to school at two o'clock,....

new_sentences 列中的重复是可以的。我曾尝试使用 np.where:

np.where(data.str.contains('o\'clock', regex=False, case=False, na=False), data["sentence"].replace()... )

但我不知道如何替换'o'clock之前的单词

提前谢谢你

【问题讨论】:

    标签: python regex pandas list dataframe


    【解决方案1】:

    用途:

    # STEP 1
    df1 = data['sentences'].str.extract(
        r"(?i)(?P<before>.*)\s(?P<clock>\w+(?=\so'clock))\s(?P<after>.*)")
    
    # STEP 2
    df1['clock'] = df1['clock'].str.replace(
        r'\w+', ','.join(my_list)).str.split(',')
    
    # STEP 3
    data['new_sentences'] = df1.dropna().explode('clock').agg(
        ' '.join, 1).groupby(level=0).agg(', '.join)
    
    # STEP 4
    data['new_sentences'] = data['new_sentences'].fillna(data['sentences'])
    

    说明/步骤:

    步骤 1:使用Series.str.extract 和给定的正则表达式模式创建一个三列数据帧,其中第一列对应于时钟 e.g. 10 之前的句子,中间列对应于时钟本身,右列对应于时钟后的句子。

    # df1
                      before  clock    after
    0      I have a class at    ten  o'clock
    1                    NaN    NaN      NaN
    2  she goes to school at  eight  o'clock
    

    步骤 2:使用Series.str.replace 将时钟列中的标记替换为my_list 中的所有项目。然后使用Series.str.split 将替换的标记拆分为分隔符,

    # df1
                      before                    clock    after
    0      I have a class at  [two, three, five, ten]  o'clock
    1                    NaN                      NaN      NaN
    2  she goes to school at  [two, three, five, ten]  o'clock
    

    第 3 步:Dataframe.explode 围绕列 clock 展开数据框 df1,使用 .agg 沿轴 1 连接列。然后在级别 0 上使用 groupby 进一步聚合此 datframe。

    # data
                                 sentences                                      new_sentences
    0        I have a class at ten o'clock  I have a class at two o'clock, I have a class ...
    1                     she is my friend                                                NaN
    2  she goes to school at eight o'clock  she goes to school at two o'clock, she goes to...
    

    第 4 步:最后使用Series.fillna 从对应的sentences 列中填充new_sentences 列中的缺失值。

    # data
                                 sentences                                      new_sentences
    0        I have a class at ten o'clock  I have a class at two o'clock, I have a class ...
    1                     she is my friend                                   she is my friend
    2  she goes to school at eight o'clock  she goes to school at two o'clock, she goes to...
    

    【讨论】:

    • 非常感谢您的回答和详细的解释,非常感谢。
    【解决方案2】:

    这符合您的预期吗?

    import re
    data= {"sentences":["I have a class at ten o'clock", "she is my friend", "she goes to school at eight o'clock"]}
    my_list=['two', 'three','five','ten']
    
    regex = re.compile(r"(\w+) (?=o'clock)", re.IGNORECASE)
    new = []
    
    for i in data["sentences"]:
        for j in my_list:
            new.append(re.sub(regex, j + ' ', i))
    
    new = list(set(new))
    
    print(new)
    

    输出:

    I have a class at two o'clock
    I have a class at ten o'clock
    she goes to school at two o'clock
    she goes to school at five o'clock
    I have a class at five o'clock
    I have a class at three o'clock
    she goes to school at ten o'clock
    she goes to school at three o'clock
    she is my friend
    

    或等价物:

    import re
    data= {"sentences":["I have a class at ten o'clock", "she is my friend", "she goes to school at eight o'clock"]}
    my_list=['two', 'three','five','ten']
    regex = re.compile(r"(\w+) (?=o'clock)", re.IGNORECASE)
    x = list(set([re.sub(regex, j + ' ', i) for j in my_list for i in data["sentences"]]))
    

    【讨论】:

    • 非常感谢您的回复。但是,我想看到的是数据框中的一个新列,包括新句子作为列表或字符串,就像问题中的那个一样。你知道是否有办法在 df 的列中做到这一点?
    • 道歉 - 我错过了你说它是数据框的部分 - 我会试一试
    • 非常感谢。是的,数据框是重要的部分,因为我喜欢看到类似于上面输出的句子。
    最近更新 更多