【发布时间】:2021-08-19 19:56:31
【问题描述】:
我正在尝试预处理语料库以返回已清理的字符串列表,但我不断收到错误消息“预期的字符串或对象之类的字节”
import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')
import time
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def load_sentences(file):
"""
Reads a COHA sample file and pre-processes it into a list of strings.
"""
sentences = []
with open(file) as f:
for line in f:
sentences.append(line)
return sentences
corpus = load_sentences('1800_sample.txt')
corpus
def preprocessing(corpus):
"""
Takes a collection of sentences and returns a cleaned version.
Complete this function by applying techniques like tokenisation,
non-word filtering, stop-word removal and stemming to clean the input.
:return : a list of strings containing cleaned sentences
:rtype : list(str)
"""
clean_text = []
# TODO: Pre-process corpus and add cleaned sentences to clean_text
# word tokenisation
# separate out words and strings of punctuation into separate white spaced words
corpus = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", corpus)
corpus = re.sub(r"([.,;:!?'\"“\(])(\w)", r"\1 \2", corpus)
#print("tokenising:", text)
# no other spelling normalization done for now
tokens = re.split(r"\s+",corpus)
tokens = clean_text
return clean_text
【问题讨论】:
标签: python nlp data-science nltk