如何根据开始和结束时间将多个列值连接到 Panda 数据框中的单个列中答案

【问题标题】：How to concatenate multiple column values into a single column in Panda dataframe based on start and end time如何根据开始和结束时间将多个列值连接到 Panda 数据框中的单个列中
【发布时间】：2021-06-14 17:03:36
【问题描述】：

我是 Python 新手，我正在尝试使用 pandas 创建一个类似于 this 的数据库。

下面是我的 df 的简化版本：

    Timestamp   A   B   C
0   2013-02-01  1   0   0
1   2013-02-02  2   10  18
2   2013-02-03  3   0   19
3   2013-02-04  4   12  20
4   2013-02-05  0   13  21
5   2013-02-06  6   14  22
6   2013-02-07  7   15  23
7   2013-02-08  0   0   0

我做的第一件事是使用以下代码创建一个新的空数据框来存储数据：

# Create frequent pattern source database
df_frequent_pattern = pd.DataFrame(columns = ["Start Time", "End Time", "Active Appliances"])

# Create start_time and end_time series using pd.date_range
df_frequent_pattern["Start Time"] = pd.date_range("2013-02-1", "2013-02-08", freq = "D")
df_frequent_pattern["End Time"] = pd.date_range("2013-02-2", "2013-02-09", freq = "D")

下面的输出：

    Start Time  End Time    Active Appliances
0   2013-02-01  2013-02-02  NaN
1   2013-02-02  2013-02-03  NaN
2   2013-02-03  2013-02-04  NaN
3   2013-02-04  2013-02-05  NaN
4   2013-02-05  2013-02-06  NaN
5   2013-02-06  2013-02-07  NaN
6   2013-02-07  2013-02-08  NaN
7   2013-02-08  2013-02-09  NaN

基于this 和this 堆栈溢出帖子，我编写了以下代码以将设备分配给正确的时间分辨率：

# Add the data to the correct 'active' period based on interval and merge the active appliances in the "active appliances column"
# Row counter for the loop
rows = 8

for row in range(rows):
  # Check if appliance is active during time resoltuion
  if df_frequent_pattern["Start Time"] <= df["Timestamp"] | df["Timestamp" <= df_frequent_pattern["End Time"]:
    # Add all the appliance active during the time resolution to the column as a string value (e.g. "A, B, C")
     df_frequent_pattern["Active Appliances"] = df["A", "B", "C"].apply(lambda row: '_'.join(row.values.astype(str)), axis = 1)

不幸的是，代码不起作用，我收到以下错误

df_frequent_pattern["Active Appliances"] = df["A", "B", "C"].apply(lambda row: '_'.join(row.values.astype(str)), axis = 1)
                                         ^
SyntaxError: invalid syntax

然而，'=' 似乎根据第二个帖子正确放置。关于如何使用我的 df 来获得如上所示的预期结果的任何想法？

应该是这样的：

   Start Time   End Time    Active Appliances
0   2013-02-01  2013-02-02  "A"
1   2013-02-02  2013-02-03  "A,B,C"
2   2013-02-03  2013-02-04  "A,C"
3   2013-02-04  2013-02-05  "A,B,C"
4   2013-02-05  2013-02-06  "A,B,C"
5   2013-02-06  2013-02-07  "A,B,C"
6   2013-02-07  2013-02-08  "A,B,C"
7   2013-02-08  2013-02-09  ""

【问题讨论】：

使用 df[["A", "B", "C"]]
对你的第一篇文章进行扎实的研究，你几乎有一个完美的问题，你只需要添加你的示例输出，但我认为你需要df[["A", "B", "C"]].astype(str).agg('_'.join,1)
@Manakin 我猜他在循环中的逻辑似乎也不正确，对吧？可能是 OP 需要添加预期的输出..
@ShubhamSharma 由于缺乏预期的输出，我不确定，从我可以看到任何大于1 的值都是活动设备，但对于3-4
@Manakin 如果值不为 0，则设备处于活动状态。在我的数据集中，这些值对应于该设备在某个时间点的总能耗。我尝试建立一个数据库，我可以在其中查看在特定时间分辨率内哪些设备处于活动状态（有能源消耗）

标签： python pandas dataframe time-series concatenation

【解决方案1】：

让我们分几步来完成。

首先，让我们确保您的 Timestamp 是日期时间。

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

然后我们可以根据您的时间戳的最小值和最大值创建一个新的数据帧。

df1 = pd.DataFrame({'start_time' : pd.date_range(df['Timestamp'].min(), df['Timestamp'].max())})

df1['end_time'] = df1['start_time'] + pd.DateOffset(days=1)

 start_time   end_time
0 2013-02-01 2013-02-02
1 2013-02-02 2013-02-03
2 2013-02-03 2013-02-04
3 2013-02-04 2013-02-05
4 2013-02-05 2013-02-06
5 2013-02-06 2013-02-07
6 2013-02-07 2013-02-08
7 2013-02-08 2013-02-09

现在我们需要创建一个数据框来合并到您的start_time 列。

让我们过滤掉任何小于 0 的值并创建一个活动设备列表：

df = df.set_index('Timestamp')
# the remaining columns MUST be integers for this to work. 
# or you'll need to subselect them. 
df2 = df.mask(df.le(0)).stack().reset_index(1).groupby(level=0)\
                 .agg(active_appliances=('level_1',list)).reset_index(0)

# change .agg(active_appliances=('level_1',list) > 
# to .agg(active_appliances=('level_1',','.join)
# if you prefer strings.



    Timestamp active_appliances
0 2013-02-01               [A]
1 2013-02-02         [A, B, C]
2 2013-02-03            [A, C]
3 2013-02-04         [A, B, C]
4 2013-02-05            [B, C]
5 2013-02-06         [A, B, C]
6 2013-02-07         [A, B, C]

然后我们可以合并：

final = pd.merge(df1,df2,left_on='start_time',right_on='Timestamp',how='left').drop('Timestamp',1)


  start_time   end_time active_appliances
0 2013-02-01 2013-02-02               [A]
1 2013-02-02 2013-02-03         [A, B, C]
2 2013-02-03 2013-02-04            [A, C]
3 2013-02-04 2013-02-05         [A, B, C]
4 2013-02-05 2013-02-06            [B, C]
5 2013-02-06 2013-02-07         [A, B, C]
6 2013-02-07 2013-02-08         [A, B, C]
7 2013-02-08 2013-02-09               NaN

【讨论】：

@ShubhamSharma 谢谢我为这个确切的问题在 SQL 中编写了一个类似的存储过程。在 Python 中使用字符串更容易
哇哦！太好了..我同意在 python 中使用字符串要容易得多。
@Manakin 非常感谢！我花了很多时间在这上面，而你却毫不费力地做到了