【问题标题】:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my codeUnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my code
【发布时间】:2018-10-15 01:32:24
【问题描述】:

当我尝试添加以下代码时,它给了我一个错误。
我已经安装了每个 python 模块,包括nltk。我添加了lxml nampy,但它不起作用。我正在使用 python3,在这种情况下,我已将 urllib2 更改为 urllib.requests
请帮助我找到解决方案。
我以

的身份运行它
python index.py

我的索引文件如下所示。 这是代码:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import ssl
import os 
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
import codecs 


def checkChar(token):
    for char in token:
        if(0 <= ord(char) and ord(char) <= 64) or (91 <= ord(char) and ord(char) <= 96) or (123 <= ord(char)):
            return False 
        else:
            continue

    return True 

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser")
    for script in soup(["script, style"]):
        script.extract()

    text = soup.get_text()

    lines = (line.strip() for line in text.splitlines())

    chunks = (phrase.strip() for line in lines for phrase in line.split(" "))

    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

path = 'crawled_html_pages/'
index = {}
docNum = 0 
stop_words = set(stopwords.words('english'))

for filename in os.listdir(path):

    collection = {}

    docNum += 1

    file = codecs.open('crawled_html_pages/' + filename, 'r', 'utf-8')

    page_text = cleanMe(file)

    tokens = nltk.word_tokenize(page_text)

    filtered_sentence = [w for w in tokens if not w in stop_words]

    filtered_sentence = []

    breakWord = ''

    for w in tokens:
        if w not in stop_words:
            filtered_sentence.append(w.lower())

    for token in filtered_sentence:
        if len(token) == 1 or token == 'and':
            continue
        if checkChar(token) == false:
            continue
        if token == 'giants':
            breakWord = token
            continue
        if token == 'brady' and breakWord == 'giants':
            break
        if token not in collection:
            collection[token] = 0
        collection[token] += 1

    for token in collection:
        if tokennot in index:
            index[token] = ''
        index[token] = index[token] + '(' + str(docNum) + ', ' + str(collection[token]) + ")"

    if docNum == 500:
        print(index)
        break
    else:
        continue

    f = open('index.txt', 'w')
    vocab = open('uniqueWords.txt', 'w')
    for term in index:
    f.write(term + ' =>' + index[term])
    vocab.write(term + '\n')
    f.write('\n')
    f.close()
    vocab.close()

     print('Finished...')

这些是我得到的错误:

> C:\Users\myworld>python index.py   
Traceback (most recent call last):  
  File "index.py][ 1]", line 49, in <module>  
    page_text = cleanMe(file)  
  File "index.py", line 22, in cleanMe  
    soup = BeautifulSoup(html, "html.parser")  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\__init__.py", line 191, in __init__  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 700, in read  
    return self.reader.read(size)  
  File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 503, in read  
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: 
     invalid start byte  

【问题讨论】:

  • 这里有几十个类似的问题。你都读了吗?

标签: python html beautifulsoup html-parsing python-unicode


【解决方案1】:

您可以通过更改 from_encoding 参数来更改 BeautifulSoup 使用的编码类型:

soup = BeautifulSoup(html, from_encoding="iso-8859-8")

【讨论】:

    猜你喜欢
    • 2020-12-26
    • 1970-01-01
    • 2022-09-26
    • 2021-11-24
    • 1970-01-01
    • 1970-01-01
    • 2018-01-26
    • 2020-03-12
    • 2020-05-08
    相关资源
    最近更新 更多