根据其他列的唯一值从数据框中选择行？答案

【问题标题】：Select rows from dataframe based on a unique values of other column?根据其他列的唯一值从数据框中选择行？
【发布时间】：2019-06-07 13:09:53
【问题描述】：

我的数据框的一列具有如下所示的值：

air_voice_no_null.loc[:,"host_has_profile_pic"].value_counts(normalize = True)*100

1.0    99.694276
0.0     0.305724
Name: host_has_profile_pic, dtype: float64

该列中每个唯一值的比例为 99:1。

我现在想创建一个新的数据框，使其具有该数据框的 1.0 的 60% 和 0.0 的 40% 以及所有行（当然行数更少）。

我尝试使用train_test_split 的sklearn.model_selection 类的strat 函数将其拆分，如下所示，但没有获得每个唯一值比例相等的数据帧。

from sklearn.model_selection import train_test_split

profile_train_x, profile_test_x, profile_train_y, profile_test_y = train_test_split(air_voice_no_null.loc[:,['log_price', 'accommodates', 'bathrooms','host_response_rate', 'number_of_reviews', 'review_scores_rating','bedrooms', 'beds', 'cleaning_fee', 'instant_bookable']],
                                                                                   air_voice_no_null.loc[:,"host_has_profile_pic"],
                                                                                   random_state=42, stratify=air_voice_no_null.loc[:,"host_has_profile_pic"])

这就是上面代码的结果，行数没有变化。

print(profile_train_x.shape)
print(profile_test_x.shape)
print(profile_train_y.shape)
print(profile_test_y.shape)

(55442, 10)
(18481, 10)
(55442,)
(18481,)

如何选择行数减少的数据集子集，同时保持host_has_profile_pic 变量的每个类的适当比例。

完整数据集链接：https://www.kaggle.com/stevezhenghp/airbnb-price-prediction

【问题讨论】：

标签： python pandas dataframe scikit-learn data-transform

【解决方案1】：

考虑以下方式：

import pandas as pd

# create some data
df = pd.DataFrame({'a': [0] * 10 + [1] * 90})

print('original proportion:')
print(df['a'].value_counts(normalize=True))

# take samples for every unique value separately
df_new = pd.concat([
    df[df['a'] == 0].sample(frac=.4),
    df[df['a'] == 1].sample(frac=.07)])

print('\nsample proportion:')
print(df_new['a'].value_counts(normalize=True))

输出：

original proportion:
1    0.9
0    0.1
Name: a, dtype: float64

sample proportion:
1    0.6
0    0.4
Name: a, dtype: float64

【讨论】：

谢谢，我必须在sample 中使用frac 才能获得我非常不平衡的数据集的公平比例。另外，我找不到frac 的工作原理，文档也没有任何相关信息。如果您能给我提供任何资源来了解frac 的工作原理，那对我真的很有帮助。
或者，告诉我如何调整frac，这样我就可以知道我的新数据框将获得每个类的多少行。
@Sharmas frac 是一个分数。背后没有魔法。假设您有X 行和1s 和Y 行和0s。然后在采样后你将分别拥有X * frac_x 和Y * frac_y。您可以使用n 而不是frac 来确定样本中的确切行数。例如，n_x 和 n_y 可能都为 min(X, Y) 以在样本中获得相等数量的 0s 和 1s。非常简单的数学很容易用 python 和 pandas 调整和实现。