UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my code答案

【问题标题】：UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my codeUnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my code
【发布时间】：2018-10-15 01:32:24
【问题描述】：

当我尝试添加以下代码时，它给了我一个错误。
我已经安装了每个 python 模块，包括nltk。我添加了lxml nampy，但它不起作用。我正在使用 python3，在这种情况下，我已将 urllib2 更改为 urllib.requests。
请帮助我找到解决方案。
我以

的身份运行它

python index.py

我的索引文件如下所示。这是代码：

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import ssl
import os 
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
import codecs 


def checkChar(token):
    for char in token:
        if(0 <= ord(char) and ord(char) <= 64) or (91 <= ord(char) and ord(char) <= 96) or (123 <= ord(char)):
            return False 
        else:
            continue

    return True 

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser")
    for script in soup(["script, style"]):
        script.extract()

    text = soup.get_text()

    lines = (line.strip() for line in text.splitlines())

    chunks = (phrase.strip() for line in lines for phrase in line.split(" "))

    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

path = 'crawled_html_pages/'
index = {}
docNum = 0 
stop_words = set(stopwords.words('english'))

for filename in os.listdir(path):

    collection = {}

    docNum += 1

    file = codecs.open('crawled_html_pages/' + filename, 'r', 'utf-8')

    page_text = cleanMe(file)

    tokens = nltk.word_tokenize(page_text)

    filtered_sentence = [w for w in tokens if not w in stop_words]

    filtered_sentence = []

    breakWord = ''

    for w in tokens:
        if w not in stop_words:
            filtered_sentence.append(w.lower())

    for token in filtered_sentence:
        if len(token) == 1 or token == 'and':
            continue
        if checkChar(token) == false:
            continue
        if token == 'giants':
            breakWord = token
            continue
        if token == 'brady' and breakWord == 'giants':
            break
        if token not in collection:
            collection[token] = 0
        collection[token] += 1

    for token in collection:
        if tokennot in index:
            index[token] = ''
        index[token] = index[token] + '(' + str(docNum) + ', ' + str(collection[token]) + ")"

    if docNum == 500:
        print(index)
        break
    else:
        continue

    f = open('index.txt', 'w')
    vocab = open('uniqueWords.txt', 'w')
    for term in index:
    f.write(term + ' =>' + index[term])
    vocab.write(term + '\n')
    f.write('\n')
    f.close()
    vocab.close()

     print('Finished...')

这些是我得到的错误：

> C:\Users\myworld>python index.py   
Traceback (most recent call last):  
  File "index.py][ 1]", line 49, in <module>  
    page_text = cleanMe(file)  
  File "index.py", line 22, in cleanMe  
    soup = BeautifulSoup(html, "html.parser")  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\__init__.py", line 191, in __init__  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 700, in read  
    return self.reader.read(size)  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 503, in read  
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: 
     invalid start byte

【问题讨论】：

这里有几十个类似的问题。你都读了吗？

标签： python html beautifulsoup html-parsing python-unicode

【解决方案1】：

您可以通过更改 from_encoding 参数来更改 BeautifulSoup 使用的编码类型：

soup = BeautifulSoup(html, from_encoding="iso-8859-8")

【讨论】：