【问题标题】：ValueError: setting an array element with a sequence. Desicion TreeValueError：使用序列设置数组元素。决策树
【发布时间】：2018-06-26 13:43:57
【问题描述】：

我认为问题出在我的变量“info.venue”上。它实际上是字符串值，我使用 labelencoder 和 hotoneencoder 对其进行了编码。但是当我尝试实施决策树时它给了我错误。当我尝试只使用 2 个变量时，它就像一个魅力。但是当我使用一个热编码器使用“info.venue”时，它给了我以下错误。

错误是“值错误：使用序列设置数组元素”

info.toss.decision info.toss.winner  info.venue
        field            Australia  Shere Bangla National Stadium
        field            Australia  Adelaide Oval
        field            Australia  Melbourne Cricket Ground
        bat              Australia  Brabourne Stadium
        bat              Australia  Melbourne Cricket Ground
        bat              Australia  Sydney Cricket Ground
        bat              Australia  Punjab Cricket Association 
        field            India      Kensington Oval, Bridgetown
        field            India      Stadium Australia
       field             India      Saurashtra Cricket Association Stadium
        bat              India      Kingsmead
        bat              India      Melbourne Cricket Ground
        bat              India      R Premadasa Stadium

代码如下：

使用 LabelEncoder 和 OneHotEncoder 对数据进行编码

> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
> labelencoder=LabelEncoder() onehotencoder=OneHotEncoder()
> df['info.toss.decision'] =
> labelencoder.fit_transform(df['info.toss.decision'])
> df['info.toss.winner']=
> labelencoder.fit_transform(df['info.toss.winner'])
> df['info.outcome.winner']=
> labelencoder.fit_transform(df['info.outcome.winner'])
> df['info.venue']=labelencoder.fit_transform(df['info.venue'])
> df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])

从数据框中选择特定的列

X = df[['info.venue','info.toss.decision','info.toss.winner']]
Y = df[['info.outcome.winner']]

将数据集拆分为训练集和测试集

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)

将决策树分类拟合到训练集

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)

“info.venue”栏如下；

info.venue

Kingsmead
Melbourne Cricket Ground
Brabourne Stadium
Kensington Oval, Bridgetown
Stadium Australia
Melbourne Cricket Ground
R Premadasa Stadium
Saurashtra Cricket Association Stadium
Shere Bangla National Stadium
Adelaide Oval
Melbourne Cricket Ground
Sydney Cricket Ground
Punjab Cricket Association IS Bindra Stadium, Mohali

【问题讨论】：

能否请您发布您的程序的实际输入和输出
请检查更新。
请关注变量 'info.venue' 因为我认为这就是我出错的地方。
@Dark 你可以编码并告诉我，即我应该在哪里进行更改？

标签： python pandas scikit-learn

【解决方案1】：

此错误是因为您试图将二维数组分配给 pandas 中的单个列。

OneHotEncoder 默认返回一个稀疏矩阵，它被 pandas 识别为一个对象数组。因此，pandas 将接受这一点并将完整的 2D 对象广播到数据帧的所有行。然后在DecisionTree的拟合过程中会抛出错误。

所以你需要改变它：

ohe_data = onehotencoder.fit_transform(df[['info.venue']]).toarray()
for i in np.arange(onehotencoder.n_values_):
    df['infovenue_one_coded_'+str(i)]=ohe_data[:,i]

然后从数据框中删除您的原始列：

new_df = df.drop('info.venue', 1)

然后将这个 new_df 传递给决策树。

更新：

由于您首先转换为一个热编码数据，然后将其拆分为训练和测试，因此我建议使用pd.get_dummies()，它将替换代码中的 LabelEncoder 和 OneHotEncoder。

替换这些行：

df['info.venue']=labelencoder.fit_transform(df['info.venue'])
df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])

与

new_df = pd.concat([df, pd.get_dummies(df['info.venue'])], axis=1)
new_df = df.drop('info.venue', axis=1, inplace=True)

【讨论】：

你可以编码并显示它@Dark 吗？因为我没有得到上面的代码，因为它有一些未知变量，比如“enc.n_values_”，你可以使用 concat 和 get_dummies 并帮助我
什么是 enc.n_values_ ??和'infovenue_one_coded_ ?? @Vivek
@MayurMahajan 我已经更正了代码。` infovenue_one_coded` 只是赋予包含 one_hot 编码数据的列的新名称。

【解决方案2】：

这是因为 X 值很像 [[0,0,1],0,2]，它不是正确的 2D 数据，这将导致 Setting an array element with a sequence。作为 scikit 中 one_hot_encoder 的替代方案，您可以使用 pandas 中的 get_dummies 并将其连接到 dataframe 即

dummies =  df['info.venue'].str.get_dummies()
ndf = pd.concat([df.drop(['info.venue'],1),dummies],1)

稍后您可以将 ndf 拆分为 X 和 Y。即

mask = ndf.columns.isin(['info.outcome.winner'])
# Were are using isin here because there will be huge number of columns generated due to get_dummies as sparse.    
X = ndf[ndf.columns[mask]].values
Y = ndf[ndf.columns[~mask]].values

【讨论】：

TypeError: 一元操作数类型错误 ~: 'Index'
如果我想使用 Venue，您有什么建议最好的算法？因为我基本上有 n 个特定国家/地区进行比赛的体育场？您可以查看上面的数据。
由于我不能在决策树中使用稀疏矩阵，我现在基本上放弃了使用场地作为字段的想法。但它确实奏效了。