【发布时间】:2015-07-18 07:43:46
【问题描述】:
假设,我有一个字符串:
string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right within the overall right-to-left context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""
我在这个字符串中有一个单词的位置,例如:
>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]
我需要从每个位置提取后面的几个单词和后面的几个单词。 如何使用 Python 和正则表达式来实现?
例如:
def look_through(d, s):
r = []
content = readFile(d["path"])
content = BeautifulSoup(content)
content = content.getText()
pos = [m.start() for m in re.finditer(s, content)]
if pos:
if "phrase" not in d:
d["phrase"] = [s]
else:
d["phrase"].append(s)
for p in pos:
r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
for b in d["decendent"] or []:
r += look_through(b, s)
return r
>>> dict = {
"content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from other scripts. Both of these typically flow left-to-right within the overall right-to-left context. </p>""",
"name": "directory",
"decendent": [
{
"content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""",
"name": "subdirectory",
"decendent": None
},
{
"content": """It tells you how to use HTML markup for elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""",
"name": "subdirectory_two",
"decendent": [
{
"content": "Name 4",
"name": "subsubdirectory",
"decendent": None
}
]
}
]
}
所以:
>>> look_through(dict, "tells you")
[
{ "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
{ "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]
谢谢!
【问题讨论】:
-
你能在你的问题中添加一个小例子吗?
-
你试过写这段代码吗?当我们可以看到您已经尝试过的方法或您正在考虑如何解决问题时,您可能会得到更好的响应。
-
仍然不清楚你是如何得到
"This article tells you how to" -
我想@Kasra 和我都很好奇您为实现
look_through所做的尝试。 -
@amccormack,我添加了一个如何获取文本位置的示例。现在我想从内容中提取文本的一部分,确切地找到它。我在这里看到了两种解决方案:使用位置或使用正则表达式。
标签: python regex string search