如何打开.html文件？答案

【问题标题】：How to open html file?如何打开.html文件？
【发布时间】：2015-01-30 07:52:46
【问题描述】：

我有一个名为test.html 的html 文件，它有一个单词בדיקה。

我打开 test.html 并使用这段代码打印它的内容：

file = open("test.html", "r")
print file.read()

但它打印??????，为什么会发生这种情况，我该如何解决？

顺便说一句。当我打开文本文件时效果很好。

编辑：我试过这个：

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

【问题讨论】：

了解 unicode、UTF-8
您需要以 UTF-8 格式打开文件。 stackoverflow.com/questions/491921/…
如果仍然无法正常工作，请发布您尝试处理的页面。

标签： python python-2.7 character-encoding

【解决方案1】：

您可以使用 'urllib' 读取 HTML 页面。

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

【讨论】：

如何在page 上进行操作。 ?喜欢从中读取特定的单词等。我可以像字符串一样使用page吗？

【解决方案2】：

import codecs
f=codecs.open("test.html", 'r')
print f.read()

试试这样的。

【讨论】：

我也尝试了 codecs.open("test.html",'r','utf-8') ，但是当我打印 f.read() 时出现 unicode 解码错误！
我正在使用终端！！
我收到了这个错误：UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0: invalid continuation byte
>>> 导入系统 >>> 打印 sys.stdout.encoding UTF-8
文件没有编码 utf-8 ，它是 windows-1255 ！

【解决方案3】：

将codecs.open 与编码参数一起使用。

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

【讨论】：

【解决方案4】：

您可以使用以下代码：

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)

如果要删除中间的所有空行并将所有单词作为字符串（也避免特殊字符，数字），那么还包括：

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*最初将st定义为string，如st=""

【讨论】：

【解决方案5】：

你可以在python3中使用'urllib'和

一样

https://stackoverflow.com/a/27243244/4815313 几乎没有变化。

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)

【讨论】：

AttributeError: 'module' object has no attribute 'request'
@tommy.carstensen 可能你应该看看这个urllib python3
谢谢。我对那个文件很熟悉。缩进错误，应该是import urllib.request。

【解决方案6】：

我今天也遇到了这个问题。我使用的是Windows，系统语言默认是中文。因此，有人可能会遇到类似的 Unicode 错误。只需添加encoding = 'utf-8'：

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

【讨论】：

【解决方案7】：

代码：

import codecs

path="D:\\Users\\html\\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)

【讨论】：

【解决方案8】：

你可以简单地使用它

import requests

requests.get(url)

【讨论】：