【发布时间】:2015-12-18 21:39:29
【问题描述】:
我想使用一些特征来训练朴素贝叶斯分类器来分类“A”或“非A”。
我具有不同值类型的三个特征: 1) total_length - 正整数 2) 元音比率 - 小数/分数 3) twoLetters_lastName - 一个包含多个双字母字符串的数组
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y 如下:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X 如下:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
我收到此错误:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
我的理解是我需要将特征转换为一个 numpy 数组作为特征向量,但我不认为我是否正在准备这个 X 向量,因为它包含非常不同的值类型。
【问题讨论】:
-
让我们从顶部的错误开始。在回溯之前,错误提示您需要重塑您的 df_Y。你试过搞砸吗?
-
我不确定它想要什么最终格式。我唯一能想到的就是在每行之间添加一个“,”。那是问题吗?明天早上我会尝试编码它,因为它已经很晚了
标签: python-2.7 pandas machine-learning scikit-learn nltk