如果后面没有或前面没有 < 或 >，则匹配单词答案

【问题标题】：Match word if not followed or preceded by < or >如果后面没有或前面没有 < 或 >，则匹配单词
【发布时间】：2017-07-26 15:10:26
【问题描述】：

我试图不匹配 XML 标记之后或之前的单词。

import re

strTest = "<random xml>hello this was successful price<random xml>"

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          print c1

结果是：

xml
this
was
successful
xml

想要的结果：

this
was
successful

我一直在尝试否定前瞻和否定后瞻断言。我不确定这是否是正确的方法，我将不胜感激。

【问题讨论】：

您不使用正则表达式来解析 XML。曾经。使用 XML 解析器。 Python 有一个built in。或者安装lxml。
Don't use Regexp to parse XML。使用 XML 解析器。
A trick 可以是：匹配你不想要的，但capture 匹配你需要的。 \w*\s*<[^>]*>\s*\w*|(\w+)

标签： python regex

【解决方案1】：

首先，直接回答您的问题：

我通过检查每个包含（主要）字母或“”的字符序列组成的“单词”来做到这一点。当正则表达式将它们提供给some_only 时，我会寻找后两个字符之一。如果两者都没有出现，我会打印“单词”。

>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
...     if '<' in matchobj.group() or '>' in matchobj.group():
...         pass
...     else:
...         print (matchobj.group())
...         pass
... 
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful

这适用于您的测试字符串；然而，正如其他人已经提到的，在 xml 上使用正则表达式通常会导致很多麻烦。

要使用更传统的方法，我必须整理该 xml 字符串中的几个错误，即将 random xml 更改为 random_xml 并使用正确的结束标记。

我更喜欢使用 lxml 库。

>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']

【讨论】：

我真的很喜欢这个解决方案，但我只想使用stdlib。如何使用 xml.etree.ElementTree 来做到这一点。顺便说一句，我正在运行 Python 2.7。
@Bman425，基本上是一样的。 import xml.etree.ElementTree as ET; tree = ET.fromstring(strTest); print tree.text.split(' ')[1:-1]
顺便说一句，这里可能需要做一些工作来提高这个答案的适用性——例如，向下查找元素并合并.tail 和.text； OP 的样本输入显然不符合他们的实际意图。
同意。我担心这可能很容易超出 OP 的技能水平。事实上，简单的问题，简单的答案。

【解决方案2】：

我会试一试。由于我们已经做的不仅仅是一个正则表达式，将它放入一个列表并删除第一个/最后一个项目：

import re

strTest = "<random xml>hello this was successful price<random xml>"

thelist = []

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          thelist.append(c1)

thelist = thelist[1:-1]

print (thelist)

结果：

['this', 'was', 'successful']

我个人会尝试解析 XML，但由于您已经拥有此代码，因此稍作修改就可以解决问题。

【讨论】：

这适用于我提出的示例，但我担心它不能很好地扩展。我同意我应该尝试使用 XML 解析器。

【解决方案3】：

使用列表的简单方法，但我假设 XML 标记的后面或前面的单词，并且正确的标记没有用空格分隔：

test = "<random xml>hello this was successful price<random xml>"

test = test.split()

new_test = []
for val in test:
  if "<" not in val and ">" not in val:
   new_test.append(val)

print(new_test)

结果将是：

['this', 'was', 'successful']

【讨论】：

【解决方案4】：

我的灵魂...

我认为根本不需要使用regex，你可以用一行列表理解来解决它：

words = [w for w in test.split() if "<" not in w and ">" not in w]

【讨论】：