如何使用 pandas 从 csv 中过滤掉非英语数据答案

【问题标题】：How to filter out non-English data from csv using pandas如何使用 pandas 从 csv 中过滤掉非英语数据
【发布时间】：2018-12-27 09:00:19
【问题描述】：

我目前正在编写一个代码来从我的 csv 文件中提取常用单词，它工作得很好，直到我列出了一个奇怪单词的条形图。我不知道为什么，可能是因为涉及到一些外来词。但是，我不知道如何解决这个问题。

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, 
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")

data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
    '' 'a function for removing punctuation'''
    import string
    #replacing the punctuations with no space,
    #which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    #return the text stripped of punctuation marks
    return text.translate(translator)

#Apply the function to each examples 
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)

#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)

# function to remove stopwords
def stopwords(text):
    '''a function for removing stopwords'''
        #removing the stop words and lowercasing the selected words
        text = [word.lower() for word in text.split()  if word.lower() not in sw]
        #joining the list of words with space separator
        return  " ". join(text)

# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)

# Top words before stemming  
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items() 

#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
 vocab.append(key)
 count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)

# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))

我想要一个按条形图排序的常用词列表，但现在它只给出频率相同的非英语单词。请帮帮我

【问题讨论】：

您能添加一个数据示例吗？
检查此stackoverflow.com/questions/27084617/… 并尝试应用到您的数据
查看 textblob textblob.readthedocs.io/en/dev

标签： python pandas nlp jupyter-notebook

【解决方案1】：

问题是您没有使用计数对词汇表进行排序，而是使用计数矢量化器创建的一些唯一 ID。

count_vectorizer.vocabulary_.items()

这不包含每个功能的计数。 count_vectorizer 不保存每个特征的计数。

因此，您将在情节中从您的语料库中看到最稀有/拼写错误的单词（因为这些单词会获得更大值的更多变化 - 唯一 ID）。获取单词计数的方法是对文本数据应用转换并将所有文档中每个单词的计数相加。

默认情况下，tf-idf 会删除标点符号，您还可以提供停用词列表以供矢量化器删除。您的代码可以减少如下。

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document ?',
]

sw= stopwords.words('english')

count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))

插入您的文本数据列，而不是 corpus。上述 sn -p 的输出将是

【讨论】：