如何在Python中删除两个双括号之间的文本答案

【问题标题】：How to remove text between two double brackets in Python如何在Python中删除两个双括号之间的文本
【发布时间】：2020-07-12 00:30:09
【问题描述】：

我正在做一些降价，把它变成 html，然后解析出没有标签的文本，只留下一组干净的字母数字字符。

问题是 Markdown 有一些自定义组件，我无法解析。

这是一个例子：

{{< custom type="phase1" >}}
    Some Text in here (I want to keep this)
{{< /custom >}}

我希望能够删除 {{ & }} 括号（包括括号）之间的所有内容，同时将文本保留在第一个实例和第二个实例之间。本质上，我只想能够删除 {{ *? }} 在文件中。给定文件中可以有任何数字。

这是我尝试过的：

def clean_markdown(self, text_string):
  html = markdown.markdown(text_string)
  soup = BeautifulSoup(html, features="html.parser")
  # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
  cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
  return cleaned

这适用于降价中的所有内容，除了它将值保留在 {{ & }} 之间的文本中。因此，在这种情况下，“自定义”一词将出现在我清理后的文本中，但我不希望它出现。

如您所见，我尝试使用美丽的汤进行提取，但由于起始值 ({{) 与结束值 (}}) 不同，因此无法正常工作

有没有人知道如何在 Python 中高效地实现一个解析器来清理这个问题？

【问题讨论】：

您要清理的 sn-ps 是否始终采用与问题中相同的三元组格式？
获取一个专门用于 Markdown 的库，或者创建自己的库来处理每个自定义组件。有关您使用 BeautifulSoup 和 Regex 的尝试，请参阅 stackoverflow.com/questions/1732348/…。
@JackFleeting，不，他们不是。括号内可以有括号的实例。 EG：{{ custom }}你好，你今天好吗{{}}？ {{/custom}}

标签： python string parsing beautifulsoup markdown

【解决方案1】：

如果我理解您要正确执行的操作，您应该可以使用 re.sub 直接在 text_tring 参数中用空字符串替换所有 {{...}} 模式

def clean_markdown(self, text_string): 
    return re.sub("{{.*}}","",text_string)

【讨论】：

@JonClements 你是对的（我假设每行只有一个，但实际上可能并非如此）

【解决方案2】：

使用正则匹配应该可以正常工作：

def clean_markdown(self, text_string):
    html = markdown.markdown(text_string)
    soup = BeautifulSoup(html, features="html.parser")
    # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
    match = re.match("{{.+}}\n(?P<text>.*)\n{{.+}}", soup.text, re.MULTILINE)
    cleaned = match.groupdict()['text']
    return cleaned

【讨论】：

【解决方案3】：

IIUC：试试这个：

result = re.sub(r"\{\{.*?\}\}", "", string).strip()
print(result)

输出：

Some Text in here (I want to keep this)

【讨论】：