Python和库re的不平衡括号错误答案

【问题标题】：Unbalanced parenthesis error with Python and the library rePython和库re的不平衡括号错误
【发布时间】：2021-04-10 14:41:51
【问题描述】：

我想将我的 hrefs 删除到我的数据集，但我收到此错误：“不平衡括号”！要删除“href”，我使用以下 python 代码：

data = data.apply(lambda x: re.sub(re.findall(r'\<a(.*?)\>', x)[0], '', x) if (len(re.findall(r'\<a (.*?)\>', x))>0) and ('href' in re.findall(r'\<a (.*?)\>', x)[0]) else x)

在此应用程序之后，我收到以下错误：

/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4211             else:
   4212                 values = self.astype(object)._values
-> 4213                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4214 
   4215         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-25-55819437c264> in <lambda>(x)
----> 1 data = data.apply(lambda x: re.sub(re.findall(r'\<a(.*?)\>', x)[0], '', x) if (len(re.findall(r'\<a (.*?)\>', x))>0) and ('href' in re.findall(r'\<a (.*?)\>', x)[0]) else x)
      2 if verbose: print('#'*10 ,'Step - Remove hrefs:'); check_vocab(data, local_vocab)

/usr/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

/usr/lib/python3.6/re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

/usr/lib/python3.6/sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

/usr/lib/python3.6/sre_parse.py in parse(str, flags, pattern)
    867     if source.next is not None:
    868         assert source.next == ")"
--> 869         raise source.error("unbalanced parenthesis")
    870 
    871     if flags & SRE_FLAG_DEBUG:

error: unbalanced parenthesis at position 36

经过几个小时的练习，我有解决这个问题的想法。

【问题讨论】：

独立于正则表达式，尝试data.str.replace()而不是.apply()模式
这对于 lambda 来说实在是太多了。将逻辑提取到适当的函数中，以便我们查看。
仅供参考，您无需在正则表达式中转义 < 和 >。

标签： python href re

【解决方案1】：

re.sub() 的第一个参数是一个正则表达式。 re.findall() 返回的字符串不是正则表达式，它们是在x 中找到的字符串。如果这恰好是一个有效的正则表达式并且它也符合您的要求，那将是非常巧合的。

如果您想替换所有<a ...>，只需将其用作re.sub() 中的正则表达式参数即可。那么就不需要条件来检查表达式是否匹配；如果没有，re.sub() 将只返回未更改的字符串。

您还应该检查<a 之后的空格，否则您将匹配名称以a 开头的任何标签。

data = data.apply(lambda x: re.sub(r'<a\s.*?>', '', x, flags=re.IGNORECASE))

但正如评论中提到的，pandas 有一个内置的正则表达式替换操作。

data = data.str.replace(r'<a\s.*?>', '', flags=re.IGNORECASE)

【讨论】：

【解决方案2】：

你的代码包含

re.sub(re.findall(...))

re.findall 在文档中执行以下操作：

返回字符串中所有非重叠匹配的列表。

但是re.sub 期望一个模式作为它的第一个参数，但是得到一个包含 html 的字符串。这就是它抛出 regex-compile-error 的原因。

【讨论】：

他们使用[0] 来获取列表的第一个元素。
@Barmar 是的，你是对的。 谢谢。 但它仍然是一个包含html 的字符串。所以也行不通。不过改变了我的答案。
你好，飞天。你是在追求名誉吗？
不是每个人都以某种方式追求声誉吗？ :D 有什么问题吗？
没有。人们不是在寻找声誉（大多数人）。见meta.stackexchange.com/questions/126987/what-is-rep-farming