【问题标题】:Replace one word of a string if that word is within a specific number of words of another word如果该单词在另一个单词的特定数量的单词中,则替换该字符串的一个单词
【发布时间】:2017-11-14 16:41:44
【问题描述】:

我在名为“DESCRIPTION”的数据框中有一个文本列。我需要找到单词“tile”或“tiles”在单词“roof”的 6 个单词内的所有实例,然后将单词“tile/s”更改为“rooftiles”。我需要对“地板”和“瓷砖”做同样的事情(将“瓷砖”改为“地板瓷砖”)。当某些词与其他词结合使用时,这将有助于区分我们所关注的建筑行业。

为了说明我的意思,数据示例和我最近的错误尝试是:

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])
df

我所追求的解决方案应该是这样的(数据框格式):

1.After the storm the roof was damaged and some of the rooftiles are missing      
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked

在这里,我尝试使用 REGEX 模式来替换单词“tiles”,但这是完全错误的……有没有办法做我想做的事情?我是 Python 新手...

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])

更新:解决方案

感谢大家的帮助!我设法使用 Jan 的代码和一些添加/调整让它工作。最终工作代码如下(使用真实的而非示例的文件和数据):

claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX    
rx =  re.compile(r'''
        (                      # outer group
            \b(floor|roof)     # floor or roof
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(tiles?)\b           # tile or tiles
        ''', re.VERBOSE)

#create the reverse REGEX
rx2 =  re.compile(r'''
        (                      # outer group
            \b(tiles?)     # tile or tiles
            (?:\W+\w+){0,6}\s* # any six "words"
        )
        \b(floor|roof)\b           # roof or floor
        ''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x)) 

#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x)) 

# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
                       , encoding = 'utf-8')

【问题讨论】:

  • 你能发布你想要的输出吗?

标签: python regex conditional


【解决方案1】:

我将向您展示一个快速而肮脏的不完整实现。你肯定可以让它更加健壮和有用。假设s 是您的描述之一:

s = "I dropped the saw and it fell on the roof and damaged roof " +\
    "and some of the tiles"

让我们先把它分解成单词(tokenize;如果你愿意,你可以去掉标点符号):

tokens = nltk.word_tokenize(s)

现在,选择感兴趣的标记并按字母顺序对它们进行排序,但请记住它们在 s 中的原始位置:

my_tokens = sorted((w.lower(), i) for i,w in enumerate(tokens)
                    if w.lower() in ("roof", "tiles"))
#[('roof', 6), ('roof', 12), ('tiles', 17)]

组合相同的标记并创建一个字典,其中标记是键,它们的位置列表是值。使用字典理解:

token_dict = {name: [p0 for _, p0 in pos] 
              for name,pos 
              in itertools.groupby(my_tokens, key=lambda a:a[0])}
#{'roof': [9, 12], 'tiles': [17]}

浏览tiles位置列表,如果有的话,看看附近是否有roof,如果有,换个词:

for i in token_dict['tiles']:
    for j in token_dict['roof']:
        if abs(i-j) <= 6: 
            tokens[i] = 'rooftiles'

最后再把单词拼起来:

' '.join(tokens)
#'I dropped the saw and it fell on the roof and damaged roof '+\
#' and some of the rooftiles'

【讨论】:

  • 感谢 DYZ!我在测试集上得到了这个工作,但是当我尝试在我的 csv 文件上运行时遇到了一些麻烦......我发现 Jan 的解决方案更容易实现
【解决方案2】:

您遇到的主要问题是 .* 在您的正则表达式中的瓷砖前面。这使得任何数量的任何字符都可以去那里并且仍然匹配。 \b 是不必要的,因为它们无论如何都处于空白和非空白之间的边界。并且分组 () 也没有被使用,所以我删除了它们。

r"(roof\s+[^\s]+\s+){0,6}tiles" 将仅匹配 6 个“单词”(由空格分隔的非空白字符组)内的屋顶瓦片。要替换它,请从正则表达式中取出匹配字符串的最后 5 个字符以外的所有字符,添加“rooftiles”,然后用更新的字符串替换匹配的字符串。或者,您可以在正则表达式中用 () 对除瓷砖以外的所有内容进行分组,然后将该组替换为自身加上“屋顶”。你不能将 re.sub 用于这么复杂的东西,因为它会替换从屋顶到瓷砖的整个匹配,而不仅仅是瓦片这个词。

【讨论】:

    【解决方案3】:

    我可以将其概括为比“屋顶”和“地板”更多的子字符串,但这似乎是一个更简单的代码:

    for idx,r in enumerate(df.loc[:,'DESCRIPTION']):
        if "roof" in r and "tile" in r:
            fill=r[r.find("roof")+4:]
            fill = fill[0:fill.replace(' ','_',7).find(' ')]
            sixWords = fill if fill.find('.') == -1 else ''
            df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "rooftile"))
        elif "floor" in r and "tile" in r:
            fill=r[r.find("floor")+5:]
            fill = fill[0:fill.replace(' ','_',7).find(' ')]
            sixWords = fill if fill.find('.') == -1 else ''
            df.loc[idx,'DESCRIPTION'] = r.replace(sixWords,sixWords.replace("tile", "floortile"))
    

    请注意,这还包括对句号(“.”)的检查。您可以通过删除 sixWords 变量并将其替换为 fill 来删除它

    【讨论】:

    • 感谢您的帮助!但是我收到此代码错误:TypeError:'float'类型的参数不可迭代
    【解决方案4】:

    您可以在这里使用带有正则表达式的解决方案:

    (                      # outer group
        \b(floor|roof)     # floor or roof
        (?:\W+\w+){1,6}\s* # any six "words"
    )
    \b(tiles?)\b           # tile or tiles
    

    a demo for the regex on regex101.com


    之后,只需将捕获的部分组合起来,并再次将它们与rx.sub() 组合在一起,并将其应用于DESCRIPTION 列的所有项目,这样您最终就会得到以下代码:
    import pandas as pd, re
    
    s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
    s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
    s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
    
    df = pd.DataFrame([list(s1), list(s2),  list(s3)],  columns =  ["DESCRIPTION"])
    
    rx = re.compile(r'''
                (                      # outer group
                    \b(floor|roof)     # floor or roof
                    (?:\W+\w+){1,6}\s* # any six "words"
                )
                \b(tiles?)\b           # tile or tiles
                ''', re.VERBOSE)
    
    # apply it to every row of "DESCRIPTION"
    df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
    print(df["DESCRIPTION"])
    


    请注意,虽然您最初的问题不是很清楚:此解决方案只会找到tiletiles 之后 roof,这意味着不会匹配像Can you give me the tile for the roof, please? 这样的句子(尽管单词tileroof 的六个单词的范围内,即)。

    【讨论】:

    • 谢谢扬!这非常有效!我明白你对 REGEX 不能双向工作的意思......我通过简单地运行代码两次找到了解决这个问题的方法......不确定这是否是最好的方法,但看起来它有效!我已经发布了我用作更新的最终代码
    猜你喜欢
    • 2020-09-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-19
    • 2013-05-19
    • 2021-06-30
    • 2016-12-21
    • 1970-01-01
    相关资源
    最近更新 更多