【发布时间】:2017-11-14 16:41:44
【问题描述】:
我在名为“DESCRIPTION”的数据框中有一个文本列。我需要找到单词“tile”或“tiles”在单词“roof”的 6 个单词内的所有实例,然后将单词“tile/s”更改为“rooftiles”。我需要对“地板”和“瓷砖”做同样的事情(将“瓷砖”改为“地板瓷砖”)。当某些词与其他词结合使用时,这将有助于区分我们所关注的建筑行业。
为了说明我的意思,数据示例和我最近的错误尝试是:
s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df
我所追求的解决方案应该是这样的(数据框格式):
1.After the storm the roof was damaged and some of the rooftiles are missing
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked
在这里,我尝试使用 REGEX 模式来替换单词“tiles”,但这是完全错误的……有没有办法做我想做的事情?我是 Python 新手...
regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])
更新:解决方案
感谢大家的帮助!我设法使用 Jan 的代码和一些添加/调整让它工作。最终工作代码如下(使用真实的而非示例的文件和数据):
claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)
#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x))
# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')
【问题讨论】:
-
你能发布你想要的输出吗?
标签: python regex conditional