【发布时间】:2020-08-10 03:29:40
【问题描述】:
我有一个dict 数据结构,其中键是机器学习分类器,值是该分类器特征重要性的pandas 数据框。例如:
for k,v in clf_importances.items():
print("Classifier: {} | Top 3 Features: {}".format(k,v.head(n=3)))
产量:
Classifier: XGBClassifier | Top 3 Features: importance
feature
LIMIT_BAL 0.024073
PAY_AMT3 0.025030
BILL_AMT1 0.025860
Classifier: LGBMClassifier | Top 3 Features: importance
feature
PAY_AMT5 155
BILL_AMT3 162
PAY_AMT6 179
它们的类型是:
print("Key Type: {} | Value Type: {}".format(type(k), type(v)))
<class 'str'> | Value Type: <class 'pandas.core.frame.DataFrame'>
我想做的是构造一个final_df w/columns:
classifier, feature_1, feature_2...feature_n
其中的值是重要性(有时为 0)。
理想情况下,我会得到一个如下所示的数据框:
| Classifier | Feature_1 | Feature_2 | Feature_3 | Feature_4 | Feature_5 | …n |
|:----------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---:|
| A | 0.062 | 0.298 | 0.000 | 0.215 | 0.000 | foo |
| B | 0.001 | 0.000 | 0.005 | 0.121 | 0.314 | foo |
| C | 0.005 | 0.054 | 0.015 | 0.000 | 0.587 | foo |
| D | 0.315 | 0.547 | 0.870 | 0.003 | 0.000 | foo |
| …n | foo | foo | foo | foo | foo | foo |
我用来生成该字典的脚本如下:
# Libraries Used
import pandas as pd, numpy as np
# Data Manipulation
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Classifiers Used
# https://www.kaggle.com/grfiv4/plotting-feature-importances
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
# Graphing Libraries
import matplotlib.pyplot as plt
# Other Configuration Settings
import warnings
warnings.filterwarnings('ignore')
# Read in the dataset
df = pd.read_csv('credit.csv')
# Take labels
labels = df['class']
# Drop that from the dataset
df.drop('class', axis=1, inplace=True)
# Remove nan values
df.dropna(inplace=True)
# Print new size
print(df.size)
# Scale the dataset between 0 and 1
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(df.values), columns=df.columns, index=df.index)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.33, random_state=np.random.randint(1,100))
# Instantiate a list of classifiers
clfs = [XGBClassifier(), LGBMClassifier(),
ExtraTreesClassifier(), ExtraTreeClassifier(),
AdaBoostClassifier(), DecisionTreeClassifier(),
GradientBoostingClassifier(), RandomForestClassifier()]
clf_accuracy = {}
clf_importances = {}
for clf in clfs:
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
accuracy = get_accuracy(preds, y_test)
clf_accuracy[clf.__class__.__name__] = accuracy
title = "Top 10 Feature Importances For {}".format(clf.__class__.__name__)
temp_df = pd.DataFrame({'importance':clf.feature_importances_})
temp_df['feature'] = X_train.columns
temp_df.sort_values(by='importance', ascending=False, inplace=True)
#temp_df = temp_df.head(n=10)
temp_df.sort_values(by='importance', inplace=True)
temp_df = temp_df.set_index('feature', drop=True)
clf_importances[clf.__class__.__name__] = temp_df
print("{} had an accuracy of : {}%".format(clf.__class__.__name__,accuracy))
temp_df.plot.barh(title=title, figsize=(8,11))
for k,v in clf_importances.items():
print("Classifier: {} | Top 3 Features: {}".format(k,v.head(n=3)))
print("Key Type: {} | Value Type: {}".format(type(k), type(v)))
如何将dataframes 中的dict 转换为一个数据帧?
【问题讨论】:
-
嗨:你试过 pd.merge() 吗?这篇文章可能很有用:shanelynn.ie/merge-join-dataframes-python-pandas-index-1
标签: python pandas dataframe scikit-learn