【问题标题】:Rowise compare two pandas dataframesRow Wise 比较两个 pandas 数据帧
【发布时间】:2021-11-11 15:43:30
【问题描述】:

我有两个 pandas 数据框

flows:
------
sourceIPAddress     destinationIPAddress    flowStartMicroseconds       flowEndMicroseconds 
163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962
104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026
17.254.170.53       163.193.124.133         2021-05-01 07:00:00.077     2021-05-01 07:00:00.083874
18.179.96.152       203.179.250.96          2021-05-01 07:00:00.112     2021-05-01 07:00:00.098296
133.103.144.34      13.154.212.11           2021-05-01 07:00:00.101     2021-05-01 07:00:00.112013

attacks:
--------
datetime                    srcIP           dstIP
2021-05-01 07:00:00.055210  188.67.130.72   133.92.239.153   
2021-05-01 07:00:00.055500  45.100.34.74    203.179.180.153   
2021-05-01 07:00:00.055351  103.113.29.26   163.193.242.75   
2021-05-01 07:00:00.056209  128.215.229.101 163.193.94.194   
2021-05-01 07:00:00.055258  45.111.22.11    163.193.138.139   

我想检查每一行 flows 是否与任何 attacks 行匹配

attacks[srcIP] == flows[srcIP] || attacks[srcIP] == flows[destIP]
&&
attacks[destIP] == flows[srcIP] || attacks[destIP] == flows[destIP]
&&
attacks[datetime] between flows[flowStartMicroseconds] and flows[flowEndMicroseconds]

有没有比迭代更有效的方法?

编辑: 数据框非常大。我包括了每个的 head()。

flows = {'sourceIPAddress': {510: '163.193.204.92',
  564: '104.247.103.181',
  590: '17.254.170.53',
  599: '18.179.96.152',
  1149: '133.103.144.34'},
 'destinationIPAddress': {510: '40.8.121.226',
  564: '163.193.124.92',
  590: '163.193.124.133',
  599: '203.179.250.96',
  1149: '13.154.212.11'},
 'flowStartMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113000'),
  564: Timestamp('2021-05-01 07:00:00.074000'),
  590: Timestamp('2021-05-01 07:00:00.077000'),
  599: Timestamp('2021-05-01 07:00:00.112000'),
  1149: Timestamp('2021-05-01 07:00:00.101000')},
 'flowEndMicroseconds': {510: Timestamp('2021-05-01 07:00:00.113962'),
  564: Timestamp('2021-05-01 07:00:00.083874'),
  590: Timestamp('2021-05-01 07:00:00.098296'),
  599: Timestamp('2021-05-01 07:00:00.112013'),
  1149: Timestamp('2021-05-01 07:00:00.101026')}}

attacks = {'datetime': {0: Timestamp('2021-05-01 07:00:00.055210'),
  1: Timestamp('2021-05-01 07:00:00.055500'),
  2: Timestamp('2021-05-01 07:00:00.055351'),
  3: Timestamp('2021-05-01 07:00:00.056209'),
  4: Timestamp('2021-05-01 07:00:00.055258')},
 'srcIP': {0: '188.67.130.72',
  1: '45.100.34.74',
  2: '103.113.29.26',
  3: '128.215.229.101',
  4: '45.111.22.11'},
 'dstIP': {0: '133.92.239.153',
  1: '203.179.180.153',
  2: '163.193.242.75',
  3: '163.193.94.194',
  4: '163.193.138.139'}}

【问题讨论】:

  • 是否包含attacks.to_dict()flows.to_dict() 以便于复制粘贴?
  • @JoshuaVoskamp 只做pd.read_clipboard(sep='\s\s+')...
  • @MartinPichler 检查...的每一行是否与...的每一行匹配听起来像是merge的问题。
  • 您是否尝试过根据您的条件合并 DF?由于OR 条件,不确定它是否会表现得更好,但也许 pandas 已经足够优化,它会
  • @QuangHoang 是的,我正在处理merge/join sol'n

标签: python pandas numpy date network-flow


【解决方案1】:

在两个数据框之间使用左连接合并,然后寻找数据的交集。

【讨论】:

    【解决方案2】:

    我不确定性能,但我会按照以下方式进行。

    1. 为此目的只有两种IP类型攻击IP和流IP。所以我会重新索引两个 DF 以具有以下格式

      flow_df : (flow_IPAddress, flowStartMicroseconds, flowEndMicroseconds)

      attack_df: (attack_IP, datetime)

    2. 然后我将使用内部连接合并它们 (left_on = "flow_IPAddress", right_on = "attack_IP")

    3. 然后我会查询结果以仅过滤有效的时间戳(例如,使用您上面编写的语句。)

    那么生成的 df 将如下所示:


    flowIPAddress            attack_IP            flowStartMicroseconds            flowEndMicroseconds            datetime  
    163.193.204.92      40.8.121.226            2021-05-01 07:00:00.113     2021-05-01 07:00:00.113962 2021-05-01 07:00:00.055210
    104.247.103.181     163.193.124.92          2021-05-01 07:00:00.074     2021-05-01 07:00:00.101026 2021-05-01 07:00:00.055210
    

    注意:如果您想维护 src 和 dst IP,您可以继续使用上述方法,但要单独考虑每一对。

    【讨论】:

    • 我认为join是不可行的。 Flows 约为 600,000 行,攻击约 250 万行。该连接将产生大量数据帧。
    • 如果你使用内部连接,你的行数会低得多,因为它只寻找交叉点
    【解决方案3】:

    解决方案:数据库

    我的解决方案是将这两个数据帧导入 PostgreSQL,并为前向和后向 IP 匹配创建两个新表,然后将它们联合在一起。

    两个单一的连接比你做一个巨大的连接要快得多。

    create table attacks_forward as 
    SELECT
    flows.*, attacks."label", attacks."sublabel"
    FROM
        flows
    JOIN attacks 
        ON flows."sourceIPAddress" = attacks."srcIP" 
        and flows."destinationIPAddress" = attacks."dstIP"
        and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
        
       
    create table attacks_backward as 
    SELECT
    flows.*, attacks."label", attacks."sublabel"
    FROM
        flows
    JOIN attacks 
        ON flows."sourceIPAddress" = attacks."dstIP" 
        and flows."destinationIPAddress" = attacks."srcIP"
        and attacks."datetime" between flows."flowStartMicroseconds" and flows."flowEndMicroseconds";
    
    create table attacks_flows as 
    SELECT * FROM attacks_forward
    UNION ALL
    SELECT * FROM attacks_backward;
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-07-21
      • 1970-01-01
      • 2018-01-15
      • 1970-01-01
      • 1970-01-01
      • 2019-01-23
      • 2019-02-05
      • 2017-06-12
      相关资源
      最近更新 更多