【发布时间】:2022-12-18 09:11:29
【问题描述】:
让我试着用一个例子来解释我的问题,我有一个很大的语料库和一个子字符串,如下所示,
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
子串和语料库都非常相似,但并不精确,
如果我做类似的事情,
import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar
在语料库中,子字符串如下所示,与我的子字符串有点不同,因为正则表达式搜索失败,有人可以建议一个非常好的类似子字符串查找的替代方案,
until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now
我确实尝试过 difflib 库,但它不能满足我的用例。
一些背景资料,
我现在拥有的子字符串是前段时间使用正则表达式re.sub("[^a-zA-Z]", " ", corpus)从预处理语料库中获得的。
但是现在我需要使用那个子字符串,我必须在语料库文本中进行反向查找并在语料库中找到开始和结束索引。
【问题讨论】:
-
如果它们只是特殊字符不同,您可以删除它们并在之后匹配 - reduced_string =re.sub("[^A-Z]", "", corpus,0,re.IGNORECASE)
-
@Chris 我的用例是我需要在语料库中找到子字符串而不删除语料库文本中的特殊字符。我得到的子字符串是从预处理的语料库中获得的,这个正则表达式
re.sub("[^a-zA-Z]", " ", corpus),我需要的是反向查找 -
您不需要删除特殊字符。您可以制作这些字符及其索引的映射,然后以与获取子字符串时相同的方式替换它们,搜索子字符串,获取开始结束索引,然后从映射中替换回特殊字符。
-
@IgorMoraru 你能用我的数据举例说明如何实现吗?
-
@user_12 我更新了我对你编辑过的问题的回答
标签: python python-3.x string