正则表达式上的 Python TypeError [重复]答案

【问题标题】：Python TypeError on regex [duplicate]正则表达式上的 Python TypeError [重复]
【发布时间】：2011-07-08 06:07:51
【问题描述】：

所以，我有这个代码：

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

然后python返回这个错误：

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

我做错了什么？

【问题讨论】：

你运行的是哪个版本的 Python？

标签： python regex python-3.x typeerror

【解决方案1】：

好吧，我的 Python 版本没有带有请求属性的 urllib，但如果我使用“urllib.urlopen(url)”，我不会返回字符串，而是得到一个对象。这是类型错误。

【讨论】：

这里是支持此文档的链接：docs.python.org/library/urllib.html#urllib.urlopen
这些是 2.7 的文档。 OP 在 cmets 中说他使用的是 3.1.3。
约翰，阅读文档。 API 还是一样的。
我的意思是，your 版本没有 request 属性，但 OP 的版本有。您对类型错误的原因是正确的。
是的，这个版本是在我提出答案后提到的。 ;)

【解决方案2】：

如果您运行的是 Python 2.6，则“urllib”中没有任何“请求”。所以第三行变成：

m = urllib.urlopen(url)

在第 3 版中你应该使用这个：

links = linkregex.findall(str(msg))

因为 'msg' 是字节对象，而不是 findall() 期望的字符串。或者您可以使用正确的编码进行解码。例如，如果“latin1”是编码，那么：

links = linkregex.findall(msg.decode("latin1"))

【讨论】：

他在 cmets 中说他正在运行 3.1.3，所以有 request。
确实，后来看到了。所以我也添加了版本 3 的解决方案。

【解决方案3】：

您为 Google 提供的网址不适合我，所以我用 http://www.google.com/ig?hl=en 代替了它，这适合我。

试试这个：

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

希望这会有所帮助。

【讨论】：

这仅适用于您的系统 Python 默认编码与网页编码相同的情况。

【解决方案4】：

TypeError: can't use a string pattern on a bytes-like object

我做错了什么？？

您在字节对象上使用了字符串模式。改用字节模式：

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

（ps：

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

)

【讨论】：

它会与 python2 中断吗？

【解决方案5】：

正则表达式模式和字符串必须是同一类型。如果要匹配常规字符串，则需要字符串模式。如果要匹配字节字符串，则需要字节模式。

在这种情况下 m.read() 返回一个字节字符串，因此您需要一个字节模式。在 Python 3 中，常规字符串是 unicode 字符串，您需要 b 修饰符来指定字节字符串文字：

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

【讨论】：

【解决方案6】：

这在 python3 中对我有用。希望这会有所帮助

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

还有我在正则表达式之前添加了 b 以将其转换为字节数组。

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

【讨论】：