如何在 Python 3 中隔离 HTML 页面的一部分答案

【问题标题】：How to isolate a part of HTML page in Python 3如何在 Python 3 中隔离 HTML 页面的一部分
【发布时间】：2016-07-17 11:02:11
【问题描述】：

我制作了一个简单的脚本来检索页面的源代码，但我想“隔离”部分 ips 以便我可以保存到 proxy.txt 文件。有什么建议吗？

import urllib.request

sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/")
sourcecode = str(sourcecode.read())
out_file = open("proxy.txt","w")
out_file.write(sourcecode)
out_file.close()

【问题讨论】：

标签： html python-3.x text

【解决方案1】：

我在您的代码中添加了几行代码，唯一的问题是 UI 版本（检查页面源）被添加为 IP 地址。

import urllib.request
import re

sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/")
sourcecode = str(sourcecode.read())
out_file = open("proxy.txt","w")
out_file.write(sourcecode)
out_file.close()

with open('proxy.txt') as fp:
    for line in fp:
        ip = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', line)

for addr in ip:
    print(addr)

更新： 这就是你要找的，BeatifulSoup 只能使用 CSS 类从页面中提取我们需要的数据，但是它需要使用 pip 安装。您无需将页面保存到文件中。

from bs4 import BeautifulSoup
import urllib.request
import re

url = urllib.request.urlopen('https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/').read()
soup = BeautifulSoup(url, "html.parser")

# Searching the CSS class name
msg_content = soup.find_all("div", class_="messageContent")

ips = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', str(msg_content))

for addr in ips:
    print(addr)

【讨论】：

非常感谢！这是一个起点！但也许可以只关注 html 页面的一部分（在本例中为
），以便脚本可以只打印 ips？无论如何再次感谢
我很傻。。“ip”是一个列表，所以我可以删除里面的项目。

【解决方案2】：

你为什么不使用re？我需要源代码来说明具体方法。

【讨论】：