多标签分类答案

【问题标题】：Multi-label classification多标签分类
【发布时间】：2019-04-01 11:39:04
【问题描述】：

我有一个看起来像

的数据集

      A         B         C         D       sex        weight
  0.955136  0.802256  0.317182 -0.708615  female       normal
  0.463615 -0.860053 -0.136408 -0.892888    male        obese
 -0.855532 -0.181905 -1.175605  1.396793  female   overweight
 -1.236216 -1.329982  0.531241  2.064822    male  underweight
 -0.970420 -0.481791 -0.995313  0.672131    male        obese

在给定 features X= [A,B,C,D] 和标签 y=[sex, weight] 的情况下，我想训练一个机器学习模型，该模型能够在给定特征 A、B 的情况下预测一个人的性别和体重， C 和 D. 如何做到这一点？您能否建议任何可以帮助我实现这一目标的图书馆或阅读材料？为了方便测试，可以使用如下代码人工生成数据集：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df['sex']  = [np.random.choice(['male', 'female']) for x in range(len(df))]
df['weight'] = [np.random.choice(['underweight', 
        'normal', 'overweight', 'obese']) for x in range(len(df)) ]

【问题讨论】：

这是一个多输出多类任务，不是多标签。它们之间有细微的差别。您可以为每个y 训练单独的模型（一个模型用于性别，另一个用于体重，如下面的答案所示）或使用支持此类任务的分类器。见"Support multiclass-multioutput" here。

标签： python scikit-learn multilabel-classification

【解决方案1】：

您需要从字符串值到整数的固定标签：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
#fixed labels
df['sex']  = [np.random.choice(['0', '1']) for x in range(len(df))]
df['weight'] = [np.random.choice(list(range(4))) for x in range(len(df))]

% matplotlib inline
from pandas import read_csv, DataFrame
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
trg = df[['sex','weight']]
trn = df.drop(['sex','weight'], axis=1)
#list of different models
models = [LinearRegression(),
          RandomForestRegressor(n_estimators=100, max_features ='sqrt'),
          SVR(kernel='linear'),
          LogisticRegression()
          ]

Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
TestModels = DataFrame()
tmp = {}
#for each model in list
for model in models:
    #get name
    m = str(model)
    tmp['Model'] = m[:m.index('(')]    
    #for each columns from result list
    for i in range(Ytrn.shape[1]):
        #learning model
        model.fit(Xtrn, Ytrn.iloc[:,i]) 
        #calculate coefficient of determination
        tmp['R2_Y%s'%str(i+1)] = r2_score(Ytest.iloc[:,0], model.predict(Xtest))
    #write data and final datarame
    TestModels = TestModels.append([tmp])
#make an index by model name
TestModels.set_index('Model', inplace=True)

fig, axes = plt.subplots(ncols=2, figsize=(10,4))
TestModels.R2_Y1.plot(ax=axes[0], kind='bar', title='R2_Y1')
TestModels.R2_Y2.plot(ax=axes[1], kind='bar', color='green', title='R2_Y2')

【讨论】：