循环遍历熊猫数据框中的行答案

【问题标题】：Loop through rows in pandas dataframe循环遍历熊猫数据框中的行
【发布时间】：2014-07-01 22:31:22
【问题描述】：

我有两个数据框：一个只有公司名称和日期。其他只有时间戳。如下图所示

    creationdate
0   2012-05-01 18:20:27.167000
1   2012-05-01 19:16:08.070000
2   2012-05-01 19:20:07.880000
3   2012-05-01 19:33:02.200000
4   2012-05-01 19:35:09.173000
5   2012-05-01 20:18:55.610000
6   2012-05-01 20:26:27.577000
7   2012-05-01 20:32:34.343000
8   2012-05-01 20:39:31.257000
9   2012-05-01 21:04:50.357000
10  2012-05-01 21:54:18.983000
11  2012-05-02 02:23:53.250000
12  2012-05-02 02:40:27.643000
13  2012-05-02 08:44:28.260000

还有

   sitename        date
0    Google  2012-05-01
1    Google  2012-05-02
2    Google  2012-05-03
3    Google  2012-05-04
4    Google  2012-05-05
5    Google  2012-05-06
6    Google  2012-05-07
7    Google  2012-05-08
8    Google  2012-05-09
9    Google  2012-05-10

如何有效地循环遍历第二个数据帧并从第一个数据帧中提取与第二个数据帧中每个日期相对应的时间戳。

【问题讨论】：

你有没有尝试过？对于datetime 来说，这似乎是一项非常容易的工作。
@Cyber ：我将第二个 df 的日期列设置为索引，并尝试循环遍历它，同时检查索引是否等于从第一个数据帧的每个元素中提取的日期。但这会每次检查第一个数据帧的所有元素。那就是我要求一种有效的方法
@Cyber ：你能告诉你简单的方法吗？我是数据框的新手。
“循环遍历第二个数据帧”和“从第二个数据帧中提取时间戳”和“对应中的每个日期第二个数据帧" - 你需要 第一个数据帧 吗？
@furas ：实际上我必须计算给定日期的第一个数据帧中给出的时间戳之间的平均时间差。我想对第二个数据框中的所有日期执行此操作。为此，我试图获取与一天相对应的时间戳并进行数学运算

标签： python pandas dataframe

【解决方案1】：

合并（内连接）这两个数据框应该可以工作：

In [96]: df1['date'] = pd.DatetimeIndex (df1.creationdate).date

In [97]: df2['date'] = pd.DatetimeIndex (df2.date).date

In [98]: df=df1.merge(df2, on='date', how='inner')

In [99]: df
Out[99]: 
                 creationdate        date sitename
0  2012-05-01 18:20:27.167000  2012-05-01   Google
1  2012-05-01 19:16:08.070000  2012-05-01   Google
2  2012-05-01 19:20:07.880000  2012-05-01   Google
3  2012-05-01 19:33:02.200000  2012-05-01   Google
4  2012-05-01 19:35:09.173000  2012-05-01   Google
5  2012-05-01 20:18:55.610000  2012-05-01   Google
6  2012-05-01 20:26:27.577000  2012-05-01   Google
7  2012-05-01 20:32:34.343000  2012-05-01   Google
8  2012-05-01 20:39:31.257000  2012-05-01   Google
9  2012-05-01 21:04:50.357000  2012-05-01   Google
10 2012-05-01 21:54:18.983000  2012-05-01   Google
11 2012-05-02 02:23:53.250000  2012-05-02   Google
12 2012-05-02 02:40:27.643000  2012-05-02   Google
13 2012-05-02 08:44:28.260000  2012-05-02   Google

然后你就可以对df 做分析了

In [100]: df['time_diff'] = df.creationdate.diff()

In [101]: df.time_diff
Out[101]: 
0                NaT
1    00:55:40.903000
2    00:03:59.810000
3    00:12:54.320000
4    00:02:06.973000
5    00:43:46.437000
6    00:07:31.967000
7    00:06:06.766000
8    00:06:56.914000
9    00:25:19.100000
10   00:49:28.626000
11   04:29:34.267000
12   00:16:34.393000
13   06:04:00.617000
Name: time_diff, dtype: timedelta64[ns]

当然，您的creationdate 必须是datetime64[ns] NOT STRING。或者你需要转换pd.DatetimeIndex (df.creationdate)

【讨论】：