【问题标题】:Count consecutive elements in a list of a dataframe on cell level in new columns在新列的单元格级别计算数据框列表中的连续元素
【发布时间】:2021-10-11 09:52:54
【问题描述】:

我有以下df:

df6 = pd.DataFrame({'name':['Sara',  'John', 'Jack'],
                   'places': ['UK,UK,UK,UK,US,CA', 'US,US,US,CA,CA,CA', 'Mexico,AUS,AUS,Mexico,Mexico']
                   })

df6

看起来像:

    name    places
0   Sara    UK,UK,UK,UK,US,CA
1   John    US,US,US,CA,CA,CA
2   Jack    Mexico,AUS,AUS,Mexico,Mexico

地点列仅关注 5 个国家/地区。我要做的是找出每个国家连续访问的次数。所以基本上输出会是这样的:

    name    UK   US   CA   Mexico   AUS    
0   Sara    4    0    0       0      0
1   John    0    3    3       0      0  
2   Jack    0    0    0       2      2

到目前为止我所做的步骤是:

df6['consecutive'] = df6.places.map(lambda x: [Counter(group[1]) for group in groupby(x.split(','))])

这给了我一个list of dicts

    name    places                        consecutive
0   Sara    UK,UK,UK,UK,US,CA             [{'UK': 4}, {'US': 1}, {'CA': 1}]
1   John    US,US,US,CA,CA,CA             [{'US': 3}, {'CA': 3}]
2   Jack    Mexico,AUS,AUS,Mexico,Mexico  [{'Mexico': 1}, {'AUS': 2}, {'Mexico': 2}]

现在我坚持如何遍历连续列中的每个单元格以找到每个单元格的 values > 1 并将 df6 重塑为最终输出:

    name    UK   US   CA   Mexico   AUS    
0   Sara    4    0    0       0      0
1   John    0    3    3       0      0  
2   Jack    0    0    0       2      2

【问题讨论】:

  • 你只取最大连续值还是最后一个?杰克有墨西哥 1 和墨西哥 2。
  • Values > 1 因为在我的数据中,如果值为 1,则意味着只有一次访问,所以对于 Jack,我选择墨西哥 2 和 AUS 2
  • 是的,但是如果 Jack 你有 Mexico, Mexico, Mexico, AUS, AUS, Mexico, Mexico 你会保留什么?

标签: python-3.x pandas list dataframe


【解决方案1】:

或者你可以使用pivot_table:

import pandas as pd

df6 = pd.DataFrame({'name':['Sara',  'John', 'Jack'],
                   'places': ['UK,UK,UK,UK,US,CA', 'US,US,US,CA,CA,CA', 'Mexico,AUS,AUS,Mexico,Mexico']
               })

df6['places'] = df6.places.str.split(',')
df6 = df6.explode('places')
df6['lag_places'] = df6.places.shift(1)
df6 = df6.query('places == lag_places').pivot_table(index = 'name', columns = 'places',  aggfunc = 'count')
df6.loc[:, df6.columns != 'places'] = df6.loc[:, df6.columns != 'places'].apply(lambda x: x+1) # add 1 according to your definition
df6.columns = [x[1] for x in df6.columns]
df6.fillna(0, inplace = True)

#      AUS   CA  Mexico   UK   US
#name                            
#Jack  2.0  0.0     2.0  0.0  0.0
#John  0.0  3.0     0.0  0.0  3.0
#Sara  0.0  0.0     0.0  4.0  0.0

【讨论】:

    【解决方案2】:

    我们可以str.splitexplode places。然后使用groupby sizeunstack 来获得带有loc 的连续计数过滤器,以仅包括大于1 次连续访问。然后groupby sum 减少到每个名称的一行,join 回到原来的 DataFrame:

    places = df6["places"].str.split(',').explode()  # Each place in own row
    
    df7 = df6[['name']].join(
        places.groupby(
            [df6['name'],  # Name
             places,  # Places
             # consecutive duplicates in separate groups
             places.ne(places.shift()).groupby(df6['name']).cumsum()]
        ).size()  # Count how many in each group
            .loc[lambda x: x > 1]  # Filter to include only > 1 visits
            .unstack(1, fill_value=0)  # Make places columns
            .groupby(level=0).sum(),  # Get single row per name
        on='name'  # join back on name column
    )
    

    df7:

       name  AUS  CA  Mexico  UK  US
    0  Sara    0   0       0   4   0
    1  John    0   3       0   0   3
    2  Jack    2   0       2   0   0
    

    【讨论】:

      【解决方案3】:

      你可以使用pd.crosstab:

      df6["places"] = df6["places"].apply(lambda x: x.split(","))
      df6 = df6.explode("places")
      
      out = pd.crosstab(df6["name"], df6["places"])
      out.index.name = None
      out.columns.name = None
      print(out)
      

      打印:

            AUS  CA  Mexico  UK  US
      Jack    2   0       3   0   0
      John    0   3       0   0   3
      Sara    0   1       0   4   1
      

      编辑:总结consecutive 列(对于连续值> 1):

      from itertools import groupby
      from collections import Counter
      
      df6["consecutive"] = df6.places.map(
          lambda x: [
              {k: v for k, v in Counter(group[1]).items() if v > 1}
              for group in groupby(x.split(","))
          ]
      )
      
      df6 = df6.explode("consecutive").reset_index(drop=True)
      out = (
          pd.concat([df6, pd.DataFrame(df6.pop("consecutive").tolist())], axis=1)
          .groupby("name")
          .sum()
      )
      print(out)
      

      打印:

             UK   US   CA  AUS  Mexico
      name                            
      Jack  0.0  0.0  0.0  2.0     2.0
      John  0.0  3.0  3.0  0.0     0.0
      Sara  4.0  0.0  0.0  0.0     0.0
      

      【讨论】:

      • 感谢 Andrej,您的输出显示了对每个国家/地区的所有访问。我正在寻找一种仅查找consecutive visits 的方法,这就是为什么我使用df6['consecutive'] = df6.places.map(lambda x: [Counter(group[1]) for group in groupby(x.split(','))]) 根据columnplaces 中的排序逗号分隔列表查找连续值的原因@
      猜你喜欢
      • 2020-10-29
      • 2017-12-26
      • 2015-12-01
      • 2019-08-01
      • 2021-08-04
      • 1970-01-01
      • 1970-01-01
      • 2021-07-31
      • 2023-03-08
      相关资源
      最近更新 更多