这里的方法是将 DF 连接回自身以获得先前的值。提供了两个这样的例子
- 前一天
- 不是
NaN的时间戳
为了透明,保留工作列。
import io
df = pd.read_csv(io.StringIO(""" creTimestamp CPULoad instnceId
0 2021-01-22 18:48:00 22.0 instanceA
1 2021-01-23 20:25:00 23.0 instanceA
2 2021-01-22 18:42:00 22.0 instanceA
3 2021-01-22 15:24:00 23.0 instanceB
4 2021-01-24 20:25:00 NaN instanceA
5 2021-01-22 08:53:00 22.0 instanceA
6 2021-01-23 19:43:00 23.0 instanceB
7 2021-01-23 15:24:00 NaN instanceA
8 2021-01-24 18:48:00 NaN instanceA
9 2021-01-24 01:51:00 NaN instanceB
10 2021-01-24 15:24:00 NaN instanceA
"""), sep="\t", index_col=0)
df.creTimestamp = df.creTimestamp = pd.to_datetime(df.creTimestamp)
# literally take previous day value
df2 = (df
.assign(yesterday=lambda dfa: dfa.creTimestamp-pd.Timedelta(days=1))
.merge(df.rename(columns={"creTimestamp":"yesterday"}).loc[:,["yesterday","CPULoad"]]
, on="yesterday", suffixes=("", "_pre"), how="left")
.assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
)
# take timestamp forward, beware if DF has multiple values for same timestamp
df2 = (df
.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
.merge(df.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
.loc[:,["timestamp","CPULoad"]]
.dropna()
, on="timestamp", suffixes=("", "_pre"), how="left")
.assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
)
输出
creTimestamp CPULoad instnceId timestamp CPULoad_pre
2021-01-22 18:48:00 22.0 instanceA 18:48:00 22.0
2021-01-23 20:25:00 23.0 instanceA 20:25:00 23.0
2021-01-22 18:42:00 22.0 instanceA 18:42:00 22.0
2021-01-22 15:24:00 23.0 instanceB 15:24:00 23.0
2021-01-24 20:25:00 23.0 instanceA 20:25:00 23.0
2021-01-22 08:53:00 22.0 instanceA 08:53:00 22.0
2021-01-23 19:43:00 23.0 instanceB 19:43:00 23.0
2021-01-23 15:24:00 23.0 instanceA 15:24:00 23.0
2021-01-24 18:48:00 22.0 instanceA 18:48:00 22.0
2021-01-24 01:51:00 NaN instanceB 01:51:00 NaN
2021-01-24 15:24:00 23.0 instanceA 15:24:00 23.0
更新
- 在大型数据帧(非样本)中,可以有多个具有不同值的时间戳
- 使用
drop_duplicates() 使 timestamp 唯一,因此 merge() 将返回原始 DF 中的行数
- 将意味着 NaN 填充了时间戳的最后观察值
- 添加了额外的加入密钥
# take timestamp forward, beware if DF has multiple values for same timestamp
# taking last observed value to prevent merge generating duplicates
# also include instnceId in join key...
df2 = (df
.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
.merge(df.assign(timestamp=lambda dfa: dfa.creTimestamp.dt.time)
.loc[:,["instnceId", "timestamp","CPULoad"]]
.dropna()
.drop_duplicates(subset=["instnceId","timestamp"], keep="last")
, on=["instnceId","timestamp"], suffixes=("", "_pre"), how="left")
.assign(CPULoad=lambda dfa: dfa.CPULoad.fillna(dfa.CPULoad_pre))
.drop(columns=["timestamp","CPULoad_pre"])
)