【问题标题】:Create new column based on next row根据下一行创建新列
【发布时间】:2014-11-14 21:21:02
【问题描述】:

我一直在 excel 中打开一个 .csv 文件(来自 MS SQL 2012),并使用公式。
我的数据从 300K 跳到 3.5mm 行,再也装不下了。 (提示笑声)

我一直在玩 R,并仔细研究了 dplyr 的变异。
然而,我需要做的似乎比 R 的出色数据操作更进一步。

我正在根据对下一行操作的逻辑添加新列,有时是数字,有时是字符串。

我是一个 python 新手,并且有一种预感,对于这个特定的任务,它可能是比 R 更好的工具,也许不是。

我四处寻找和搜索,仍然没有找到与我面临的问题类似的示例。

我以前在这个 source.csv 中放过

id,event,eventDate,direction  
id1,apple,1977-06-26 00:00:00.000,positive  
id1,apple,1980-07-01 00:00:00.000,positive  
id1,candy,1980-05-01 00:00:00.000,negative  
id1,apple,1980-11-21 00:00:00.000,positive  
id2,fruit,1980-06-26 00:00:00.000,positive  
id2,cookie,1990-06-26 00:00:00.000,negative  
id2,cavity,1991-07-15 00:00:00.000,negative  
id2,apple,1991-07-16 00:00:00.000,positive  
id2,apple,1997-01-16 00:00:00.000,positive  
id3,cookie,2010-04-20 00:00:00.000,negative  
id4,cookie,2010-04-20 00:00:00.000,negative  
id4,cookie,2010-04-20 00:00:01.000,negative  

并创建这个 output.csv

id,event,eventDate,direction,idEventNumber,nextEvent,daysUntilNextEvent  
id1,apple,1977-06-26 00:00:00.000,positive,1000,negative,1040  
id1,apple,1980-07-01 00:00:00.000,positive,1001,positive,143  
id1,candy,1980-05-01 00:00:00.000,negative,1002,positive,61  
id1,apple,1980-11-21 00:00:00.000,positive,1003,noFurtherEvent,-1  
id2,fruit,1980-06-26 00:00:00.000,positive,1000,negative,3652  
id2,cookie,1990-06-26 00:00:00.000,negative,1001,negative,384  
id2,cavity,1991-07-15 00:00:00.000,negative,1002,positive,1  
id2,apple,1991-07-16 00:00:00.000,positive,1003,positive,2011  
id2,apple,1997-01-16 00:00:00.000,positive,1004,noFurtherEvent,-1  
id3,cookie,2010-04-20 00:00:00.000,negative,1000,noFurtherEvent,-1  
id4,cookie,2010-04-20 00:00:00.000,negative,1000,negative,0  
id4,cookie,2010-04-20 00:00:01.000,negative,1001,noFurtherEvent,-1  

我的新专栏将
-对事件编号(从 1000 开始,检查下一行的 id 是否匹配,如果匹配,则添加一个,否则从 @ 1000 开始)
- 复制下一个事件(如果存在)
-count daysUntilNextEvent(mssql 日期时间输出之间的数学运算,没有小数天数,-1 表示最后一个事件)

你会如何解决这个问题?

感谢您的时间|想法|鼓励|指针|示例。

更正:上面的原始 output.csv 示例包含一个错误,该示例已得到更正,但这发生在多次快速响应之后,因此他们的正确问题和 cmets 现在可能看起来不合适。

【问题讨论】:

  • Powershell 可以很容易地做到这一点。我想像(psudocode)Import-CSV $path | ForEach{add-member calls to $lastline; output $lastline object to pipe;$lastline=$_} | Export-CSV $newpath
  • 第二行和第三行的 eventDate 值似乎互换了。这是故意的吗?
  • 我不小心调换了07和05的日期,但是很多回复都解决了上述信息,所以我留下了他们的订单并更正了output.csv

标签: python r powershell csv dplyr


【解决方案1】:

这是我使用data.table 的方法:

require(data.table) ## 1.9.4+
DT = fread("input.csv")[, eventDate := as.Date(eventDate)]   ## -(1)

DT[order(id, eventDate),                                     ## -(2)
     `:=`(idEventNumber = seq.int(1000L, length.out=.N), 
          nextEvent = c(tail(direction, -1L), "noFurtherEvent"), 
          daysUntilNextEvent = c(diff(eventDate), -1L)), 
by=id]

1.. 首先,我们使用fread - 快速文件阅读器读取csv 并将eventDatecharacter 转换为Date 格式。

  1. 然后我们按id, eventDate 排序,这样日期就按升序排列,在这个顺序上,我们按id 分组,并添加三列按引用 - 即添加这些列到DT 就地

    • idEventNumber - 我们从 1000 开始并继续将其增加到 .N 的长度 - 这是一个特殊变量,用于保存每个组的观察次数。
    • nextEvent - 我们从direction 中获取除第一个此组 之外的所有值,并将noFurtherEvent 添加为最后一个值。
    • daysUntilNextEvent - 我们获取所有eventDate 值的差异此组 并将-1L 添加到最后一个观察值。

请注意,输入顺序会被保留,而天数会以正确的顺序计算。


这是输出:

#      id  event  eventDate  direction idEventNumber      nextEvent daysUntilNextEvent
#  1: id1  apple 1977-06-26 positive            1000     negative                 1040
#  2: id1  apple 1980-07-01 positive            1002     positive                  143
#  3: id1  candy 1980-05-01 negative            1001     positive                   61
#  4: id1  apple 1980-11-21 positive            1003 noFurtherEvent                 -1
#  5: id2  fruit 1980-06-26 positive            1000     negative                 3652
#  6: id2 cookie 1990-06-26 negative            1001     negative                  384
#  7: id2 cavity 1991-07-15 negative            1002     positive                    1
#  8: id2  apple 1991-07-16 positive            1003     positive                 2011
#  9: id2  apple 1997-01-16 positive            1004 noFurtherEvent                 -1
# 10: id3 cookie 2010-04-20 negative            1000 noFurtherEvent                 -1
# 11: id4 cookie 2010-04-20 negative            1000     negative                    0
# 12: id4 cookie 2010-04-20 negative            1001 noFurtherEvent                 -1

【讨论】:

    【解决方案2】:

    您可以在R 中使用dplyr 执行此操作。如果你的数据框叫ana,可以试试下面的方法。

    library(dplyr)
    
    ana %>%
        mutate(group = cumsum(!duplicated(id)),
               eventDate = as.Date(eventDate, format = "%Y-%m-%d"))%>%
        arrange(id, eventDate) %>%
        group_by(group) %>%
        mutate(num = row_number() + 999,
              nextEvent = lead(direction, default = "noFurtherEvent"),
              daysUntilNextEvent = as.numeric(lead(eventDate) - eventDate),
              daysUntilNextEvent = replace(daysUntilNextEvent, is.na(.), "-1"))
    
    #    id  event  eventDate  direction group  num      nextEvent daysUntilNextEvent
    #1  id1  apple 1977-06-26 positive       1 1000     negative                 1040
    #2  id1  candy 1980-05-01 negative       1 1001     positive                   61
    #3  id1  apple 1980-07-01 positive       1 1002     positive                  143
    #4  id1  apple 1980-11-21 positive       1 1003 noFurtherEvent                 -1
    #5  id2  fruit 1980-06-26 positive       2 1000     negative                 3652
    #6  id2 cookie 1990-06-26 negative       2 1001     negative                  384
    #7  id2 cavity 1991-07-15 negative       2 1002     positive                    1
    #8  id2  apple 1991-07-16 positive       2 1003     positive                 2011
    #9  id2  apple 1997-01-16 positive       2 1004 noFurtherEvent                 -1
    #10 id3 cookie 2010-04-20 negative       3 1000 noFurtherEvent                 -1
    #11 id4 cookie 2010-04-20 negative       4 1000       negative                  0
    #12 id4 cookie 2010-04-20   negative     4 1001 noFurtherEvent                 -1
    

    【讨论】:

    • 我认为这需要大约 2 小时,有 300 万行,4 个 obs
    • @ben_says 感谢您的评论。我意识到我可以在mutate 中添加replace。这肯定会加快进程。但是,它不会像 Arun 的想法那么快。
    【解决方案3】:

    您的输出样本与输入样本不正确:“id1,apple,1980-07-01”在输入中为“正”,但在输出中为“负”。考虑到这一点,这是 PowerShell 中的一个示例:

    $sInFile = "infile.csv"
    $sOutFile = "outfile.csv"
    
    $cInTable = Import-Csv -Path $sInFile `
        | Sort-Object -Property @("id", "eventDate")
    $cOutTable = $cInTable
    
    $oIdCounters = New-Object PSObject
    
    for ($i = 0; $i -lt $cInTable.Count; $i++) {
        if ([Int]$oIdCounters.($cInTable[$i].id) -lt 1000) {
            $oIdCounters | Add-Member -MemberType "NoteProperty" `
                -Name $cInTable[$i].id -Value 1000 
        } else {
            $oIdCounters.($cInTable[$i].id) += 1
        }
    
        $cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
            -Name "idEventNumber" -Value $oIdCounters.($cInTable[$i].id)
    }
    
    for ($i = $cInTable.Count - 1; $i -ge 0; $i--) {
        if ($cOutTable[$i].idEventNumber -eq $oIdCounters.($cInTable[$i].id)) {
            $sNextEvent = "noFurtherEvent"
            $iDaysUntilNextEvent = -1
        } else {
            $sNextEvent = $cInTable[$i+1].direction
            $iDaysUntilNextEvent = ([DateTime]$cInTable[$i+1].eventDate -`
                                    [DateTime]$cInTable[$i].eventDate).Days
        }
    
        $cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
            -Name "nextEvent" -Value $sNextEvent
        $cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
            -Name "daysUntilNextEvent" -Value $iDaysUntilNextEvent
    }
    
    $cOutTable | Export-Csv -Path $sOutFile -NoTypeInformation
    

    【讨论】:

    • 我可能误读了你,但 nextEvent 应该列出下一行的事件方向。输入 (id1,apple,1980-07-01 00:00:00.000,positive) 所以输出是相同的加上下一行 (id1,candy,1980-05-01 00:00:00.000,negative data) 因此就是为什么输出是正正,然后是正负,因为第二行正跟随第一行正,第三行负跟随第二行正。
    • 是的。 $cInTable[$i+1].direction 是下一行的事件方向。
    • 啊,我明白你的意思了...我并不想弄乱日期顺序...谢谢。源将始终按正确日期排序。非常感谢。
    【解决方案4】:

    我选择了一个稍微不同的方向。我将最后一个条目存储在一个变量中,然后在处理下一个条目时对其进行修改并传递它,然后在 ForEach 循环之后赶上最后一个条目。

    $Results = @()
    $IDCount=1000
    $LastLine = $false
    Import-CSV $InPath | sort id,eventdate | ForEach{
        If($LastLine -and $LastLine.ID -eq $_.ID){
            Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
            Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue $_.Direction
            $Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue ([datetime]$_.EventDate - [datetime]$LastLine.EventDate|Select -Expand Days) -PassThru
            $IDCount++
        }ElseIf($LastLine){
            $IDCount=1000
            Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
            Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue 'NoFurtherEvent'
            $Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue '-1' -PassThru}
        $LastLine = $_}
    Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
    Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue 'NoFurtherEvent'
    $Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue '-1' -PassThru
    $Results | Export-CSV $OutPath -NoTypeInformation
    

    输出是:

    "id","event","eventDate","direction","IDEventNumber","nextEvent","daysUntilNextEvent"
    "id1","apple","1977-06-26 00:00:00.000","positive","1000","negative","1040"
    "id1","candy","1980-05-01 00:00:00.000","negative","1001","positive","61"
    "id1","apple","1980-07-01 00:00:00.000","positive","1002","positive","143"
    "id1","apple","1980-11-21 00:00:00.000","positive","1000","NoFurtherEvent","-1"
    "id2","fruit","1980-06-26 00:00:00.000","positive","1000","negative","3652"
    "id2","cookie","1990-06-26 00:00:00.000","negative","1001","negative","384"
    "id2","cavity","1991-07-15 00:00:00.000","negative","1002","positive","1"
    "id2","apple","1991-07-16 00:00:00.000","positive","1003","positive","2011"
    "id2","apple","1997-01-16 00:00:00.000","positive","1000","NoFurtherEvent","-1"
    "id3","cookie","2010-04-20 00:00:00.000","negative","1000","NoFurtherEvent","-1"
    "id4","cookie","2010-04-20 00:00:00.000","negative","1000","negative","0"
    "id4","cookie","2010-04-20 00:00:01.000","negative","1001","NoFurtherEvent","-1"
    

    【讨论】:

      【解决方案5】:

      这是我在 python 中的解决方案:

      from datetime import datetime, timedelta
      
      _data = '''id1,apple,1977-06-26 00:00:00.000,positive
      id1,apple,1980-07-01 00:00:00.000,positive
      id1,candy,1980-05-01 00:00:00.000,negative
      id1,apple,1980-11-21 00:00:00.000,positive
      id2,fruit,1980-06-26 00:00:00.000,positive
      id2,cookie,1990-06-26 00:00:00.000,negative
      id2,cavity,1991-07-15 00:00:00.000,negative
      id2,apple,1991-07-16 00:00:00.000,positive
      id2,apple,1997-01-16 00:00:00.000,positive
      id3,cookie,2010-04-20 00:00:00.000,negative
      id4,cookie,2010-04-20 00:00:00.000,negative
      id4,cookie,2010-04-20 00:00:01.000,negative'''
      

      我首先创建一个以 ids 为键的字典,其中包含该 ID 的项目列表:

      data = {}
      for line in _data.split('\n'):
          fields = line.split(',')
          data.setdefault(fields[0], []).append(fields[1:])
      

      然后我以 sorted() 的顺序遍历这个 dict 以保留 id 的顺序。对于每个 id,我创建一个由一对行或单行组成的新列表。对于每个 id,我将 it_id 初始化为 1000,并为该 id 打印的每一行递增。

      然后我遍历这个列表。根据我们使用的是一对还是单行,我要么计算增量,要么不计算。

      for item in sorted(data):
          it_id = 1000
          for sub in [data[item][i:i+2] for i in range(len(data[item]))]:
              if len(sub) == 2:
                  delta = datetime.strptime(sub[1][1][:-4], '%Y-%m-%d %H:%M:%S') - datetime.strptime(sub[0][1][:-4], '%Y-%m-%d %H:%M:%S')
                  print '%s,%s,%d,%s,%d' % (item, ','.join(sub[0]), it_id, sub[1][2], delta.days)
              it_id += 1
              else:
                  print '%s,%s,%d,%s,%d' % (item, ','.join(sub[0]), it_id, 'noFurtherEvent', -1)
      

      输出:

      id1,apple,1977-06-26 00:00:00.000,positive,1000,positive,1101
      id1,apple,1980-07-01 00:00:00.000,positive,1001,negative,-61
      id1,candy,1980-05-01 00:00:00.000,negative,1002,positive,204
      id1,apple,1980-11-21 00:00:00.000,positive,1003,noFurtherEvent,-1
      id2,fruit,1980-06-26 00:00:00.000,positive,1000,negative,3652
      id2,cookie,1990-06-26 00:00:00.000,negative,1001,negative,384
      id2,cavity,1991-07-15 00:00:00.000,negative,1002,positive,1
      id2,apple,1991-07-16 00:00:00.000,positive,1003,positive,2011
      id2,apple,1997-01-16 00:00:00.000,positive,1004,noFurtherEvent,-1
      id3,cookie,2010-04-20 00:00:00.000,negative,1000,noFurtherEvent,-1
      id4,cookie,2010-04-20 00:00:00.000,negative,1000,negative,0
      id4,cookie,2010-04-20 00:00:01.000,negative,1001,noFurtherEvent,-1
      

      正如另一个帖子所建议的那样,您的示例输出在增量方面可能是错误的。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-12-02
        • 2016-08-16
        • 2022-11-17
        • 2022-10-15
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-03-15
        相关资源
        最近更新 更多