在 Python 中使用 Pandas 进行特征工程，每次计算使用多行答案

【问题标题】：Feature Engineering in Python with Pandas Using Multiple Rows Per Calculation在 Python 中使用 Pandas 进行特征工程，每次计算使用多行
【发布时间】：2017-12-07 02:45:20
【问题描述】：

我有以下格式的 CSV 数据：

+-----------------+--------+-------------+
| reservation_num |  rate  | guest_name  |
+-----------------+--------+-------------+
| B874576         | 169.95 | Bob Smith   |
| H786234         | 258.95 | Jane Doe    |
| H786234         | 258.95 | John Doe    |
| F987354         | 385.95 | David Jones |
| N097897         | 449.95 | Mark Davis  |
| H567349         | 482.95 | Larry Stein |
| N097897         | 449.95 | Sue Miller  |
+-----------------+--------+-------------+

我想向 DataFrame 添加一个名为“rate_per_person”的功能（列）。它的计算方法是将特定预订号的费率除以具有与其住宿相关的相同预订号的客人总数。

这是我的代码：

#Importing Libraries
import pandas as pd

# Importing the Dataset
ds = pd.read_csv('hotels.csv')

for index, row in ds.iterrows():
    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

还有错误信息：

Traceback (most recent call last):

  File "<ipython-input-3-0668a3165e76>", line 2, in <module>
    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__
    return self._getitem_column(key)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
    return self._get_item_cache(key)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
    values = self._data.get(item)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get
    loc = self.items.get_loc(item)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)

  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)

  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)

  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)

KeyError: 'B874576'

根据错误消息，很明显最后一行代码的ds[row['reservation_num']].count 部分存在问题。但是，我不确定以允许我以编程方式创建新功能的方式获取每次预订的客人数量的正确方法。

【问题讨论】：

标签： python python-3.x pandas machine-learning data-science

【解决方案1】：

选项 1
pd.Series.value_counts 和 map

df.rate / df.reservation_num.map(df.reservation_num.value_counts())

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

选项 2
groupby、transform 和 size

df.rate / df.groupby('reservation_num').rate.transform('size')

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

选项 3
np.unique 和 np.bincount

u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

选项 3.5
np.unique 进行排序，因此不能像 pd.factorize 那样扩展。在我使用它们的上下文中，它们做同样的事情。因此，我使用了一个函数，该函数使用一个关于数组长度的轶事阈值，在该长度处，一个变得比另一个更具性能。它被编号为3.5，因为它与3的确切答案大致相同

def factor(a):
    if len(a) > 10000:
        return pd.factorize(a)[0]
    else:
        return np.unique(a, return_inverse=True)[1]

def count(a):
    f = factor(a)
    return np.bincount(f)[f]

df.rate / count(df.reservation_num.values)  

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

时机

%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())
%timeit df.rate / df.groupby('reservation_num').rate.transform('size')

1000 loops, best of 3: 650 µs per loop
1000 loops, best of 3: 768 µs per loop

%%timeit
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]

10000 loops, best of 3: 131 µs per loop

【讨论】：

【解决方案2】：

您可以使用grouppby 和transform 执行此操作：

df['rate_per_person'] = df.groupby('reservation_num')['rate'].transform(lambda x: x.iloc[0] / x.size)

输出：

     reservation_num    rate      guest_name  rate_per_person
0   B874576           169.95    Bob Smith             169.950
1   H786234           258.95   Jane Doe               129.475
2   H786234           258.95    John Doe              129.475
3   F987354           385.95     David Jones          385.950
4   N097897           449.95    Mark Davis            224.975
5   H567349           482.95    Larry Stein           482.950
6   N097897           449.95    Sue Miller            224.975

【讨论】：