【问题标题】:Corpus preprocessing语料库预处理
【发布时间】:2021-08-19 19:56:31
【问题描述】:

我正在尝试预处理语料库以返回已清理的字符串列表,但我不断收到错误消息“预期的字符串或对象之类的字节”

import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')
import time
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def load_sentences(file):
    """
    Reads a COHA sample file and pre-processes it into a list of strings.
    """
    sentences = []
    with open(file) as f:
        for line in f:
            sentences.append(line)          
    return sentences
corpus =  load_sentences('1800_sample.txt')
corpus

result visualising data

def preprocessing(corpus):
    """
    Takes a collection of sentences and returns a cleaned version. 
    
    Complete this function by applying techniques like tokenisation, 
    non-word filtering, stop-word removal and stemming to clean the input.
    
    :return : a list of strings containing cleaned sentences
    :rtype : list(str)
    """
    clean_text = []
    # TODO: Pre-process corpus and add cleaned sentences to clean_text
        # word tokenisation
    # separate out words and strings of punctuation into separate white spaced words
    corpus = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", corpus)
    corpus = re.sub(r"([.,;:!?'\"“\(])(\w)", r"\1 \2", corpus)
    #print("tokenising:", text)
    # no other spelling normalization done for now
    tokens = re.split(r"\s+",corpus)
    tokens = clean_text   
    return clean_text

error

【问题讨论】:

    标签: python nlp data-science nltk


    【解决方案1】:

    您的预处理函数将 clean_text 设置为一个空列表,然后将其返回。空列表不是“字符串”或 b“bytes-like-object”

    在以某种方式将令牌处理分配给 clean_text 之前,您可能打算使用该行。只需确保在返回之前重新构建字符串即可。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-12-03
      • 1970-01-01
      • 1970-01-01
      • 2020-06-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多