【问题标题】:ValueError: array length does not match index lengthValueError:数组长度与索引长度不匹配
【发布时间】:2016-08-31 23:51:33
【问题描述】:

我正在为 kaggle 之类的比赛练习,我一直在尝试使用 XGBoost,并试图让自己熟悉 python 3rd 方库,例如 pandas 和 numpy。

我一直在审查这个名为 Santander 客户满意度分类的特殊竞赛的脚本,并且我一直在修改不同的分叉脚本以便对它们进行试验。

这是一个修改后的脚本,我试图通过它来实现 XGBoost:

import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb

df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test  = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")   

df_train = df_train.replace(-999999,2)

id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values

X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)

clf = xgb.XGBClassifier(objective='binary:logistic',
                missing=9999999999,
                max_depth = 7,
                n_estimators=200,
                learning_rate=0.1, 
                nthread=4,
                subsample=1.0,
                colsample_bytree=0.5,
                min_child_weight = 3,
                reg_alpha=0.01,
                seed=7)

clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)

print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train) 
'''
test = []
result = []
for each in id_test:
    test.append(each)
for each in y_pred[:,1]:
    result.append(each)

print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)

这是堆栈跟踪:

Traceback (most recent call last):
  File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818

我根据对不同解决方案的搜索尝试了解决方案,但我无法弄清楚错误是什么。我做错了什么?请告诉我

【问题讨论】:

  • 你的回溯在哪里?提供任何其他人都可以运行的最小脚本(例如,没有外部 csv 数据),我们可以为您提供更好的帮助
  • 您定义了两次X_test,这可能会导致问题
  • @tdihp:哦,我完全忘记了!!!谢谢提醒,马上更新

标签: python pandas numpy kaggle


【解决方案1】:

问题是您将X_test 定义为@maxymoo 提到的两次。首先你把它定义为

X_test = df_test.drop(['ID'], axis=1).values

然后你重新定义它:

X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)

这意味着现在X_test 的大小等于0.4*len(X_train)。然后:

y_pred = clf.predict_proba(X_test)

您已经对 X_train 的那部分进行了预测,并且您尝试使用该部分和初始 id_test 创建数据帧,其长度与原始 X_test 相同。
您可以在train_test_split 中使用X_fitX_eval,而不是隐藏初始X_trainX_test,因为对于您的cross_validation,您也有不同的X_train,这意味着您不会得到正确的答案或者您@ 987654337@ 与公共/私人分数不准确。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-02-22
    • 2021-06-30
    • 1970-01-01
    • 1970-01-01
    • 2021-10-23
    • 2018-12-19
    • 2022-01-13
    • 1970-01-01
    相关资源
    最近更新 更多