修改文本文件中非唯一行的字段答案

【问题标题】：Modify fields of not unique lines in text file修改文本文件中非唯一行的字段
【发布时间】：2014-10-31 00:15:52
【问题描述】：

我正在编写一个脚本，该脚本需要从单个文本文件中获取重复的行并更改日期字段的值，但只更改时间字段。字段分隔符是 TAB 所以 ...

# cat enviando4
1414743351      2014-11-01 09:00:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-30 23:00:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-30 23:00:00
1414743351      2014-10-31 10:25:00
1414743351      2014-10-31 09:15:51
1414743351      2014-11-01 10:25:00

我按日期对行进行排序：

/bin/sort enviando4 -k2 -t $'\t' -o enviando4

# cat enviando4
1414743351      2014-10-30 23:00:00
1414743351      2014-10-30 23:00:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-31 09:15:51
1414743351      2014-10-31 09:15:51
1414743351      2014-10-31 10:25:00
1414743351      2014-11-01 09:00:00
1414743351      2014-11-01 10:25:00

现在我需要将至少 4 分钟（从不减去）添加到任何重复的日期至少一个，这样我将只有唯一的日期。它看起来像这样：

# cat enviando4
1414743351      2014-10-30 23:04:00 --> add 4
1414743351      2014-10-30 23:00:00 --> no change
1414743351      2014-10-31 09:19:51 --> add 4
1414743351      2014-10-31 09:23:51 --> add 8
1414743351      2014-10-31 09:15:51 --> no change
1414743351      2014-10-31 10:25:00 --> unique, no change 
1414743351      2014-11-01 09:00:00 --> unique, no change
1414743351      2014-11-01 10:25:00 --> unique, no change

并验证这些更改没有产生新的重复值。我坚持这一点。谢谢。

【问题讨论】：

如果您颠倒了排序顺序，那么您将需要在 last 重复的行上增加 4 分钟 - 在 awk 之类的操作中更容易做到 - 在缓冲区中保留一行，检查键的变化，调整保持线，...

标签： bash sorting unique

【解决方案1】：

你的任务并不难。 Bash 具有出色的日期操作实用程序。您需要做的是sort the original 列表，然后是排序文件的read each line，compare the date/time to the previous 日期时间并使用计数器，将重复时间增加counter * 4min 偏移量，然后write the new date/time to your output file. 有很多方法可以处理时间调整。最简单的方法是将日期/时间字符串转换为自纪元以来的秒数。然后只需将偏移量添加到重复时间并将其转换回所需的日期/时间格式。

以下示例显示了执行此操作的一种方法。有几种操作可以组合，但我将偏移量计算分开以使其更具可读性。该脚本将输入文件作为第一个参数（我将其默认设置为dat/env4.dat用于我的测试，请随意设置）。然后脚本排序到一个临时文件，读取临时文件，对重复项进行时间调整，然后将输出写入inputfile.out，在退出前删除临时文件。如果您有任何问题，请告诉我：

#!/bin/bash

ifn="${1:-dat/env4.dat}"            # set input filename (ifn) and validate

[ -r "$ifn" ] || {
    printf "\n  Error: input file not readable. Usage: %s [<filename> (dat/env4.dat)]\n\n" "${0//*\//}" >&2
    exit 1
}

## initialize variables
tfn="/tmp/${ifn//*\//}.tmp"         # set temp filename  (tfn)
ofn="${ifn}.out"                    # set output filename (ofn)
:> "$ofn"                           # truncate output file
pdate=0                             # initialize prior date
cnt=0                               # counter variable
tos=240                             # time offset in seconds (4 min.)
tse=0                               # time since epoch in seconds

sort "$ifn" > "$tfn"                # sort input file into temp file & validate

[ -r "$tfn" ] || {
    printf "\n  Error: sort failed to produce a tmp file or tmp file not readable\n\n" >&2
    exit 1
}

## read temp file into index/idate and add 4 min to each successive duplicate
while read -r index idate || [ -n "$idate" ]; do

    if [ "$pdate" = "$idate" ]; then
        tse=$(date -d "$idate" +%s) # get time since epoch for idate
        cnt=$((cnt+1))              # increase counter
        nos=$((cnt*tos))            # set new time offset (not Nitrous Oxide)
        ntm=$((tse+nos))            # set new time including offset
        # write new time to output
        printf "%s\t%s\n" "$index" "$(date -d "@${ntm}" +"%F %T" )" >> "$ofn"
    else
        cnt=0; nos=0                # reset counter and new time offset
        # write output unchanged
        printf "%s\t%s\n" "$index" "$idate" >> "$ofn"
    fi

    pdate="$idate"                  # save current date/time as prior date/time

done <"$tfn"

[ -r "$tfn" ] && rm "$tfn"          # remove temp file

输入文件：

$ cat dat/env4.dat
1414743351      2014-11-01 09:00:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-30 23:00:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-30 23:00:00
1414743351      2014-10-31 10:25:00
1414743351      2014-10-31 09:15:51
1414743351      2014-11-01 10:25:00

输出文件：

$ cat dat/env4.dat.out
1414743351      2014-10-30 23:00:00
1414743351      2014-10-30 23:04:00
1414743351      2014-10-31 09:15:51
1414743351      2014-10-31 09:19:51
1414743351      2014-10-31 09:23:51
1414743351      2014-10-31 10:25:00
1414743351      2014-11-01 09:00:00
1414743351      2014-11-01 10:25:00

注意：如果您想翻转重复项，以便首先出现较大的偏移时间，您应该可以对输出文件进行操作。在offset while loop 中执行此操作会使该问题的逻辑过于复杂。如果您想在offset while loop 中包含附加代码，基本方法是将之前的日期和任何匹配的日期存储在一个数组中，然后偏移数组日期/时间值并以相反的顺序写出它们。每次遇到新的日期/时间时取消设置数组。

包括电子邮件和调整字段的附录

如果您有兴趣在输出中添加一个e-mail，然后在date portion 和new date field 的time portion 之间添加一个time adjustment，您可以相对地这样做只需在开头添加电子邮件，然后将date 返回的新字符串拆分为date part 和time part，并在输出中的两者之间插入00:0n:00，即可轻松实现。无论您使用printf 还是echo 都没有区别。 printf 更灵活，但有时echo 也有优势。

注意：在下面的代码中，我形成了00:0n:000（n 是4, 8, etc..，假设只有 2 个重复项。如果有 3 个或更多，你将不得不处理它如果调整后的时间大于8 minutes，则调整逻辑以形成00:nn:00（例如12, 16, 20, ...代表3rd, 4th, 5th, ...重复）。

如果您还有其他问题，请告诉我。

## beginning part of script unchanged
# tse=0                               # time since epoch in seconds
email="mi@email.com"                # email to output
adjtm=4                             # simple value to provide adjustment in 00:04:00, etc.

sort "$ifn" > "$tfn"                # sort input file into temp file & validate

[ -r "$tfn" ] || {
    printf "\n  Error: sort failed to produce a tmp file or tmp file not readable\n\n" >&2
    exit 1
}

## read temp file into index/idate and add 4 min to each successive duplicate
while read -r index idate || [ -n "$idate" ]; do

    if [ "$pdate" = "$idate" ]; then
        tse=$(date -d "$idate" +%s) # get time since epoch for idate
        cnt=$((cnt+1))              # increase counter
        adj=$((cnt*adjtm))          # compute 4, 8, ... for 00:0n:00 output
        nos=$((cnt*tos))            # set new time offset (not Nitrous Oxide)
        ntm=$((tse+nos))            # set new time including offset
        ndt="$(date -d "@${ntm}" +"%F %T" )"  # new date/time value
        nd1=${ndt% *}               # date portion (first field) of ntd
        nd2=${ndt#* }               # time portion (second filed) of ntd
        ncmb="$nd1 00:0${adj}:00 $nd2" # new combined "date 00:0n:00 time" string
        # write new time to output
        printf "%s\t%s\t%s\n" "$email" "$index" "$ncmb" >> "$ofn"
    else
        cnt=0; nos=0                # reset counter and new time offset
        nd1=${idate% *}             # date portion (first field) of idate
        nd2=${idate#* }             # time portion (second filed) of idate
        ncmb="$nd1 00:00:00 $nd2"   # new combined "date 00:00:00 time" string (no adj)
        # write output unchanged
        printf "%s\t%s\t%s\n" "$email" "$index" "$ncmb" >> "$ofn"
    fi

    pdate="$idate"                  # save current date as prior date

done <"$tfn"

[ -r "$tfn" ] && rm "$tfn"          # remove temp file

输出文件：（输入相同）

$ bash env4-2.sh
mi@email.com    1414743351      2014-10-30 00:00:00 23:00:00
mi@email.com    1414743351      2014-10-30 00:04:00 23:04:00
mi@email.com    1414743351      2014-10-31 00:00:00 09:15:51
mi@email.com    1414743351      2014-10-31 00:04:00 09:19:51
mi@email.com    1414743351      2014-10-31 00:08:00 09:23:51
mi@email.com    1414743351      2014-10-31 00:00:00 10:25:00
mi@email.com    1414743351      2014-11-01 00:00:00 09:00:00
mi@email.com    1414743351      2014-11-01 00:00:00 10:25:00

【讨论】：

感谢您的脚本 :-)。对我帮助很大。我一直在尝试使脚本适应我的需要，但我遇到了一些问题： printf "%s\t%s\n" "$index" "$(date -d "@${ ntm}" +"%F %T" )" >> "$ofn" 我使用的字段多于两个。这是输出： mi@email.com 186808 2014-11-02 00:04:00 12:06:00 。旧日期是 2014-11-02 12:06:00，我得到的是 00:04:00 和 12:006:00，而不是 2014-11-02 12:10:00。如有必要，我可以发布我的代码。
无需发帖。所有更改都将简单地由read 控制，然后在读取后进行连接。在我写一个附录之前，确认基本上我们只需要不考虑00:04:00，而是考虑日期字符串为2014-11-02 12:06:00，以便在你的mi@email.com 186808 2014-11-02 00:04:00 12:06:00总字符串中检查重复和调整时间
是的。我的初始字符串是：pg 1358 mi@email.com 186808 2014-11-02 12:06:00 0 2 它应该看结尾：pg 1358 mi@email.com 186808 2014-11-02 12:10： 00 0 2 我用来写回它的行是： printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "$SISTEMA" "$MENSAXE_ID" "$EMAIL" "$NUMERO_DESTINATARIOS" "$(/bin/date -d "@${NTM}" +"%F %T" )" "$ESTADO" "$GRUPO" >> pendientes.sorted但我得到： pg 1358 mi@email.com 186808 2014-11-02 00:04:00 12:06:00 0 2 你的脚本真的很好用。谢谢
我想我明白了。查看输出。（我调整了00:0n:00 和实际时间。如果您不需要调整实际时间，只需要删除代码的00:0n:00 :)
不，我不需要实际时间。我的问题是，如果日期是 2014-11-02 23:58:00，当脚本添加这 4 分钟时，日期将是 2014-11-03 00:02:00。无论如何它将更改为 2014-11-03？我过去常常做 /bin/date +"%Y-%m-%d %T" --date "now +$ADD_MINUTES mins" 来做这件事，但我不能让它工作因为不能代替现在有一个变量。谢谢