语料库预处理答案

【问题标题】：Corpus preprocessing语料库预处理
【发布时间】：2021-08-19 19:56:31
【问题描述】：

我正在尝试预处理语料库以返回已清理的字符串列表，但我不断收到错误消息“预期的字符串或对象之类的字节”

import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')
import time
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def load_sentences(file):
    """
    Reads a COHA sample file and pre-processes it into a list of strings.
    """
    sentences = []
    with open(file) as f:
        for line in f:
            sentences.append(line)          
    return sentences

corpus =  load_sentences('1800_sample.txt')
corpus

result visualising data

def preprocessing(corpus):
    """
    Takes a collection of sentences and returns a cleaned version. 
    
    Complete this function by applying techniques like tokenisation, 
    non-word filtering, stop-word removal and stemming to clean the input.
    
    :return : a list of strings containing cleaned sentences
    :rtype : list(str)
    """
    clean_text = []
    # TODO: Pre-process corpus and add cleaned sentences to clean_text
        # word tokenisation
    # separate out words and strings of punctuation into separate white spaced words
    corpus = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", corpus)
    corpus = re.sub(r"([.,;:!?'\"“\(])(\w)", r"\1 \2", corpus)
    #print("tokenising:", text)
    # no other spelling normalization done for now
    tokens = re.split(r"\s+",corpus)
    tokens = clean_text   
    return clean_text

error

【问题讨论】：

标签： python nlp data-science nltk

【解决方案1】：

您的预处理函数将 clean_text 设置为一个空列表，然后将其返回。空列表不是“字符串”或 b“bytes-like-object”

在以某种方式将令牌处理分配给 clean_text 之前，您可能打算使用该行。只需确保在返回之前重新构建字符串即可。

【讨论】：