【发布时间】:2020-03-02 00:07:09
【问题描述】:
我正在尝试创建一个文本分类器来确定摘要是否表示访问护理研究项目。我正在从具有两个字段的数据集导入:Abstract 和 Accessclass。摘要是关于项目的 500 字描述,Accessclass 为 0 表示不与访问相关,1 表示与访问相关。我仍处于开发阶段,但是当我查看 0 和 1 标签的一元和二元时,它们是相同的,尽管文本的色调截然不同。我的代码中是否缺少某些内容?例如,我是否不小心将负数或正数加倍?任何帮助表示赞赏。
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
df = pd.read_excel("accessclasses.xlsx")
df.head()
from io import StringIO
col = ['accessclass', 'abstract']
df = df[col]
df = df[pd.notnull(df['abstract'])]
df.columns = ['accessclass', 'abstract']
df['category_id'] = df['accessclass'].factorize()[0]
category_id_df = df[['accessclass', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'accessclass']].values)
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=4, norm='l2', encoding='latin-1', ngram_range=(1,
2), stop_words='english')
features = tfidf.fit_transform(df.abstract).toarray()
labels = df.category_id
print(features.shape)
from sklearn.feature_selection import chi2
import numpy as np
N = 2
for accessclass, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(accessclass))
print(" . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))
【问题讨论】:
-
尝试将
TfidfVectorizer中的ngram_range参数设置为等于(1, 2)。所以,你的矢量化器应该是tfidf = TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True, min_df=4, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english') -
我收到一个关键字参数错误,因为 ngram_range(1,2) 在那里两次。但是,我认为我上面代码中的 ngram_range 已经等于 (1, 2) 。也许我错过了什么?
-
对不起,我没看到
-
你能分享一个数据样本吗?
-
当然...我已在此处添加:github.com/inthetoast/pythonstuff/blob/master/…
标签: python scikit-learn nlp text-classification sklearn-pandas