无法在 Python 中打开 html 文件答案

【问题标题】：Cannot open html file in Python无法在 Python 中打开 html 文件
【发布时间】：2015-04-02 05:02:36
【问题描述】：

我正在尝试收集 html 文件中有多少个超链接。为此，我想在 Python 中读取 html 文件并搜索所有 </a> 锚点。但是，似乎当我尝试通过 python 传递一个 html 文件时，我收到一条错误消息：

"UnicodeDecodeError: 'ascii' 编解码器无法在位置解码字节 0xe2 1819: 序数不在范围内(128)"

但是，如果我将相同的文本复制并粘贴到 txt 文件中，那么我的代码就可以工作。我的代码如下：

def links(filename):
    infile = open(filename)
    content = infile.read()
    infile.close()
    anchorTagEnd = content.count("</a>")
    return anchorTagEnd

print(links("DePaul CDM - College of Computing and Digital Media.html"))

【问题讨论】：

您使用的是哪个 Python 版本？ Python 3 中的 Unicode 处理与 Python2 中的工作方式略有不同。你是如何获取 HTML 的？在 Python 中有多种方法可以做到这一点，如果我们不知道代码在做什么，那么帮助您修复代码并不容易。
我使用的是 3.4.2。我只是通过使用上面编写的函数来获取 HTML，并通过 print 函数将 html 文件传递给它。到目前为止，这就是我所有的代码。
抱歉，我没有意识到 HTML 文件已经在您的硬盘上：我假设您正在使用 Python 从网站下载 HTML。我的错。
您的 HTML 文件似乎包含 Unicode，当您打开文件时，您应该告诉 open() 函数该文件使用哪种特定的 Unicode 编码；可能是utf-8，但应该在 HTML 文件顶部附近提及编码。请参阅官方Python docs 了解更多信息。面向 HTML 的文件打开器可以自行读取该信息，但通用的 open() 不这样做。

标签： python html python-3.x html-parsing

【解决方案1】：

为什么不使用 HTML 解析器来计算 HTML 文件中的链接。

使用BeautifulSoup：

from bs4 import BeautifulSoup

def links(filename):
    soup = BeautifulSoup(open(filename))
    return len(soup.find_all('a'))

print(links("DePaul CDM - College of Computing and Digital Media.html"))

使用lxml.html：

import lxml.html

def links(filename):
    tree = lxml.html.parse(filename)
    return tree.xpath('count(//a)')[0]

print(links("DePaul CDM - College of Computing and Digital Media.html"))

【讨论】：

不幸的是，我仅限于我们在课堂上学到的功能。我们没有学习任何 html 解析函数，所以如果没有特殊的解析器就无法解析 html 文件，那么我可能只需要将所有内容保存在 txt 文件中，然后使用 txt 文件运行代码。