【发布时间】:2018-10-20 17:05:22
【问题描述】:
我有一个这样的数据集:
COD| COMPDESC| CDESCR
0| 10| STRUCTURE:BODY:DOOR| AUTOMATIC DOOR LOCKS WHEN USED, WILL NOT RELEA...
1| 18| VEHICLE SPEED CONTROL| VEHICLE SUDDENLY ACCELERATED OUT OF CONTROL, B...
2| 24| STEERING:WHEEL AND HANDLE BAR| STEERING WHEEL BOLTS LOOSENEDAND ROCKED BACK A...
3| 40| SUSPENSION:FRONT:MACPHERSON STRUT| MISALIGNMENT, CAUSING VEHICLE TO PULL TO THE R...
4| 55| STEERING:WHEEL AND HANDLE BAR| DUE TO DEFECT STEERING BOLTS, STEERING WHEEL I...
在使用 NLTK 进行词干提取和应用 CountVectorizer 之后,我尝试使用朴素贝叶斯和 SVM 进行预测,但预测远低于使用只有 20.000 行数据集的article(我的有 100 万行,但我由于内存限制,一次只能使用 100.000 行)。
我尝试了ngram-range: (1,1) 和ngram-range: (1,2),结果几乎相同。最后,需要更多内存,因此我不得不减少正在处理的行数。
我可以做些什么来提高这种准确性?改进数据清理可能是一种方法,但还有什么用呢?考虑到我已经在使用 Stemming 和删除停用词(包括数字)。
# The row indices to skip - make sure 0 is not included to keep the header!
skip_idx = random.sample(range(1, num_lines), num_lines - size)
dataset = pd.read_csv('SIMPLE_CMPL.txt', skiprows=skip_idx,
delimiter=',', quoting=True, header=0, encoding="ISO-8859-1",
skip_blank_lines=True)
train_data, test_data = train_test_split(dataset, test_size=0.3)
from sklearn.feature_extraction import text
import string
my_stop_words = text.ENGLISH_STOP_WORDS.union(["tt",'one','two','three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '0','1','2','3','4','5','6','7','8','9','0']).union(string.punctuation)
# Stemming Code
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words=my_stop_words, ngram_range=(1,2))
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer(use_idf=False)),
('mnb', MultinomialNB(fit_prior=False, alpha=0.01))])
text_mnb_stemmed = text_mnb_stemmed.fit(train_data['CDESCR'], train_data['COMPID'])
predicted_mnb_stemmed = text_mnb_stemmed.predict(test_data['CDESCR'])
np.mean(predicted_mnb_stemmed == test_data['COMPID'])
# 0.6255
# Stemming Code
from sklearn.linear_model import SGDClassifier
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words=my_stop_words, ngram_range=(1,1))
text_svm_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer(use_idf=True)),
('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=0.001, n_iter = np.ceil(10**6 / train_data['COD'].count()), random_state=60))])
text_svm_stemmed = text_svm_stemmed.fit(train_data['CDESCR'], train_data['COMPID'])
predicted_svm_stemmed = text_svm_stemmed.predict(test_data['CDESCR'])
np.mean(predicted_svm_stemmed == test_data['COMPID'])
#0.6299
【问题讨论】:
标签: python machine-learning scikit-learn nltk text-classification