如何使用 pandas 对 HTTP 请求日志进行分组答案

【问题标题】：How to group HTTP requests log using pandas如何使用 pandas 对 HTTP 请求日志进行分组
【发布时间】：2019-05-15 23:02:50
【问题描述】：

我有一个 HTTP 请求日志。包含的特征有：capture_time、ip、method、url、content、user_agent

所有这些信息都在一个 csv 文件中。

我想在 10 分钟间隔内对来自同一 IP 的所有请求进行分组。

如何使用 pandas 做到这一点？

示例数据集：

date ip method url content agent

2019-04-24 23:16:48.742466
187.20.211.99
发布
/delivery/check_location
bairro=Vila&cidade=利马
Mozilla/5.0 （iPhone；CPU iPhone OS 12_2 类似 Mac OS X）AppleWebKit/605.1.15 (KHTML like Gecko) Mobile/15E148

我已经尝试过使用 groupby 方法。

我想将所有请求内容合并到一行中（对于那些使用 ip 和 time 分组的内容）

【问题讨论】：

所以你只关心时间和ip，其他的不重要？您需要在同一时间跨度内计算该 IP 的数量吗？
我想根据 ip 和每个请求之间的时间间隔对它们进行分组。（10 分钟）。我想在同一行上连接的方法、网址和内容。例如：POST url 内容 GET url2 conten2 ...
同一个IP在同一个时间跨度会有不同的方法url和内容吗？
是的。每个请求的方法和内容可以不同。
那么，如果是这样的话，您仍然只想要该 IP 的 1 行吗？

标签： python pandas dataframe

【解决方案1】：

df.set_index('date', inplace = True)

unnesting(df.resample('10T')['ip'].unique().reset_index(), ['ip']).reset_index(drop = True)

首先，您需要将日期设置为索引。接下来，您需要以 10 分钟的增量重新采样时间，查看您的 IP 列并获取每个时间跨度的唯一值。接下来，您需要使用以下函数取消嵌套unique() 创建的列表。

##https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/55839330#55839330

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

在此之后，您可以连接您计划的任何内容。

编辑：

# Set index to the date column
df.set_index('date', inplace = True)

# 10 minutes in nanoseconds 
ns10min=10*60*1000000000

#Calculate the new 10 min.   
df.index = pd.to_datetime(((df.index.astype(np.int64) // ns10min) * ns10min))

#Groupby both index and ip, then look at the first.
df.groupby([df.index, df['ip']]).first()

【讨论】：

似乎工作得很好。我只是不知道如何连接其他字段，因为它创建了一个带有 ip 和时间间隔的新数据帧。有什么想法吗？
很高兴看到您正在使用 unnest :-)
@WeNYoBen 是的！我非常彻底地阅读了该线程，以了解不同方法的所有优点/缺点，并且您的方法非常方便：D
@LuccaZenobio Soo 这就是为什么我事先问你问题的原因，如果一个 IP 地址在 10 分钟内出现两次，其他列不同，你不能连接它，所以它仍然是一行。除非您希望您的 DF 非常宽且列重复？
我想将所有其他栏目内容合二为一。就像你的小费一样，但还有一列连接了所有值。如果我能得到至少我能做到的索引

【解决方案2】：

我使用 Ben Pap 的方法根据日期对 ips 进行分组。之后，我得到了一个包含 IP 和时间间隔的数据框。要加入其他列并添加到此数据框中，我这样做了：

content= []
row_iterator = test.iterrows()
for index, row in row_iterator:
    texto = ""
    resul = df2.loc[(df2[df2.columns[1]] == row[2]) & ((row[0] < df2.index) & (df2.index <  row[0] + pd.Timedelta(minutes=10) ) )]
    for i, (_, current_row) in enumerate(resul.iterrows()):
        texto += " " + current_row.values[2] + " " + current_row.values[3] + " " + current_row.values[4] 
     content.append(texto)

【讨论】：