在 pandas 列中，如何找到特定值出现的最大连续行数？答案

【问题标题】：In a pandas column, how to find the max number of consecutive rows that a particular value occurs?在 pandas 列中，如何找到特定值出现的最大连续行数？
【发布时间】：2021-03-04 21:32:49
【问题描述】：

假设我们有以下带有列名的df。

df = pd.DataFrame({
    'names':['Alan', 'Alan', 'John', 'John', 'Alan', 'Alan','Alan', np.nan, np.nan, np.nan, np.nan, np.nan, 'Christy', 'Christy','John']})

>>> df
      names
0      Alan
1      Alan
2      John
3      John
4      Alan
5      Alan
6      Alan
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12  Christy
13  Christy
14     John

我想在列上运行一个应用函数，该函数返回特定值出现的最大连续次数。起初，我想为 NaN 执行此操作，但通过扩展想切换到列中的任何其他值。

说明：如果我们对 Nan 运行应用程序，结果将是 5，因为 5 是 NaN 连续出现的最高次数。如果在列中的其他值之后还有后续行，然后 NaN 连续出现超过 5 次，那么这就是结果。

如果我们为 Alan 运行申请，结果将是 3，因为 3 将取代第一次出现连续 Alan 的 2。

【问题讨论】：

标签： python pandas dataframe feature-engineering

【解决方案1】：

df_counts = df #create new df to keep the original
df_counts['names'].fillna("NaN", inplace=True) # replace np.nan with string
df_counts['counts'] = df.names.groupby((df.names != df.names.shift()).cumsum()).transform('size') # count consecutive names
df_counts = df_counts.sort_values('counts').drop_duplicates("names",keep='last') #keep only the highest counts

def get_counts(name):
  return df_counts.loc[df['names'] == name, 'counts'].item()

然后get_counts("Alan") 将返回3，get_counts("NaN") 将返回5。

【讨论】：

这个实际上是不正确的，因为它返回“6”表示“John”，“4”表示“NaN”，“5”表示“Christy”。
谢谢，已修复。

【解决方案2】：

这是您可以与groupby 一起使用的解决方案：

# convert nans to str
df["names"] = df["names"].fillna("NaN")

# assign a subgroup to each set of consecutive rows
df["subgroup"] = df["names"].ne(df["names"].shift()).cumsum()

# take the max length of any subgroup that belongs to "name"
def get_max_consecutive(name):
    return df.groupby(["names", "subgroup"]).apply(len)[name].max()

for name in df.names.unique():
    print(f"{name}: {get_max_consecutive(name)}")

输出：

Alan: 3
John: 2
NaN: 5
Christy: 2

说明：

pandas.Series.ne 接受两个系列并返回一个新系列，如果每行中的元素不相等，则返回 True，如果相等则返回 False。

我们可以使用 df["names"] 并将其与自身进行比较，除了移位 1 (df["names"].shift())。每当名称从以前的值更改时，这将返回 True。

所以这给了我们一个布尔系列，其中每个True 都标志着名称的变化：

df["names"].ne(df["names"].shift())

0      True
1     False
2      True
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12     True
13    False
14     True
Name: names, dtype: bool

那么，.cumsum 只是这个系列的累加和。在这种情况下，True 等于 1，False 为 0。这有效地为我们提供了一个新数字，每次名称从先前的值更改时。我们可以将它分配给它自己的列subgroup，以便我们稍后使用 groupby。

df.names.ne(df.names.shift()).cumsum()

0     1
1     1
2     2
3     2
4     3
5     3
6     3
7     4
8     4
9     4
10    4
11    4
12    5
13    5
14    6
Name: names, dtype: int64

最后，我们可以使用 .groupby 在“名称”和“子组”列上使用多索引对数据框进行分组。现在我们可以应用len 函数来获取每个子组的长度。

df.groupby(["names", "subgroup"]).apply(len)

names    subgroup
Alan     1           2
         3           3
Christy  5           2
John     2           2
         6           1
NaN      4           5
dtype: int64

奖励：如果您想查看每个名称和子组的 len，可以使用 .reset_index 将 .apply 返回的系列转换为数据框：

df_count = df.groupby(["names", "subgroup"]).apply(len).reset_index(name="len")
df_count

输出：

     names  subgroup  len
0     Alan         1    2
1     Alan         3    3
2  Christy         5    2
3     John         2    2
4     John         6    1
5      NaN         4    5

【讨论】：

谢谢！简洁的解决方案。你能解释一下 df["names"].ne(df["names"].shift()).cumsum() 命令吗？
@elixir 我在上面的编辑中添加了解释。让我知道是否清楚！
@khuyuh OMG！我如何投票这 10 倍？！？感谢您的超详细解释！ :)

【解决方案3】：

由于np.nan == np.nan 为 False，因此您必须在计数之前检查提供的值是否为 NaN。要获取连续元素，您可以使用 itertools 的groupby。

def max_consecutives(value):
    if pd.isna(value):
        value_equals = lambda x: pd.isna(x)
    else:
        value_equals = lambda x: x == value
        
    def max_consecutive_values(col):
        elements_per_group_counter = (
            sum(1 for elem in group if value_equals(elem))
            for _, group in groupby(col)
        )
        return max(elements_per_group_counter)
    return max_consecutive_values

df.apply(max_consecutives(np.nan)) # returns 5

df.apply(max_consecutives("Alan")) # returns 3

【讨论】：