Python 3 HTML 解析器答案

【问题标题】：Python 3 HTML parserPython 3 HTML 解析器
【发布时间】：2012-02-10 22:09:31
【问题描述】：

我相信每个人都会抱怨，并告诉我查看文档（我有），但我只是不明白如何实现与以下相同：

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

到目前为止，我在 python3 中的所有内容是：

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

说真的，任何建议（请不要告诉我阅读http://docs.python.org/release/3.0.1/library/html.parser.html，因为我已经学习python 1 天，很容易混淆）一个简单的例子会很棒！！！

【问题讨论】：

您可能更喜欢this site 来获取您的 IP：您无需通过 HTML 来查找它。
您发布的代码错误，因为您丢失了缩进（print(lines) 行应该缩进）。
我知道，当我在发布时将其设置为代码时它一直消失。在文件中是正确的。
我运行的代码还获取地理位置等（一般）

标签： python bash parsing scraper

【解决方案1】：

# no need for .readlines here
for ln in f:
    if 'align="center">' in ln:
        print(ln)

但请务必阅读Python tutorial。

【讨论】：

TypeError: 类型 str 不支持缓冲区 API
文件“ip.py”，第 7 行，在 if 'align="center">' in ln: TypeError: Type str does not support the buffer API
@user969617，将'align="center">' 更改为b'align="center">'。
@Rob 是的，我注意到了 mlfavor 的建议。

【解决方案2】：

这是基于上述 larsmans 的回答。

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
    if b'align="center">' in line:
        print(next(f).decode().rstrip())
f.close()

解释：

for line in f 遍历类文件对象 f 中的行。 Python 让您可以像遍历列表中的项目一样遍历文件中的行。

if b'align="center">' in line 在当前行中查找字符串 'align="center">'。 b 表示这是一个字节缓冲区，而不是一个字符串。 urllib.reqquest.urlopen 似乎将结果解释为二进制数据，而不是 unicode 字符串，而未经修饰的 'align="center">' 将被解释为 unicode 字符串。（这就是上面TypeError的来源。）

next(f) 获取文件的下一行，因为您的原始 awk 脚本打印了 'align="center">' 之后的行而不是当前行。 decode 方法（字符串在 Python 中具有方法）获取二进制数据并将其转换为可打印的 unicode 对象。 rstrip() 方法去除任何尾随空格（即每行末尾的换行符。

【讨论】：

【解决方案3】：

我可能会使用正则表达式来获取 ip 本身：

import re
import urllib

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]

将打印格式的第一个字符串：1-3digits, period, 1-3digits,...

我认为您正在寻找该行，您可以简单地扩展 findall() 表达式中的字符串来处理它。（有关更多详细信息，请参阅 python 文档以获取更多信息）。顺便说一句，匹配字符串前面的 r 使其成为原始字符串，因此您不需要在其中转义 python 转义字符（但您仍然需要转义 RE 转义字符）。

希望有帮助

【讨论】：

你的代码给我：TypeError: can't use a string pattern on a bytes-like object
这是 unicode/bytes 问题的另一个症状。你需要html_text=f.read().decode()。
有趣，这是 Python 2.7 与 Python 3 的效果吗？我运行了代码（在 Python 2.7 上）并且它有效。感谢 mlfavor 指出解决方案。