使用 awk 或 sed 操作 .csv 文件的第 n 列答案

【问题标题】：Manipulate nth column of a .csv file with awk or sed使用 awk 或 sed 操作 .csv 文件的第 n 列
【发布时间】：2020-07-17 17:24:17
【问题描述】：

我有一个包含 6 列的 .csv 文件：

source  raised_time cleared_time    cause   pcause  sproblem
source1 rtime1  ctime1  cause1  communicationsSubsystemFailure#model.route.1.2  oMCIFailure#model.route.1.2
source2 rtime2  ctime2  cause2  equipmentMalfunction#model.route.1.2    deviceNotActive#model.route.1.2

我想使用以下规则操作 .csv 文件的第 5 列和第 6 列：

将第 5 列和第 6 列的第一个字母转换为大写
将字符串保留为字符：“#”并删除尾随部分（在 # 字符之后）
在小写字母和大写字母之间留一个空格

所以想要的格式是：

source  raised_time cleared_time    cause   pcause  sproblem
source1 rtime1  ctime1  cause1  Communication Subsystem Failure OMCI Failure
source2 rtime2  ctime2  cause2  Equipment Malfunction   Device Not Active

如何使用 awk 或 sed 命令来做到这一点？

我尝试使用以下命令将第一个字母转换为大写：

awk 'BEGIN {$5 = toupper(substr($5,1,1))
    substr($5, 2)}1' input_file

但它不起作用。

【问题讨论】：

您的描述将导致输出O M C I Failure。你想如何处理那些（显然）不需要的空间。
您搜索了什么，找到了什么？您尝试过什么，它是如何失败的？
@tripleee，我尝试使用以下命令将第一个字母转换为大写：“awk 'BEGIN {$5 = toupper(substr($5,1,1)) substr($5, 2 )}1' input_file" 但它不起作用。
@WilliamPursell，是的，你是对的。也许我应该像这样编辑规则：在小写字母和大写字母之间留一个空格。
如果不是现在，以后您会后悔将本来应位于一列中的数据分解为任何列的不可知的 0-n 空格。我建议转换为Equipment_Malfunction（使用下划线，而不是空格）。以这种格式存储您的数据，如果您有不理解下划线的挑剔用户，那么sed 's/_/ /g' file > report_version.txt 将在他们的报告中为他们提供他们想要的内容，并且您仍然会有一个常规数据集，即。 $1,$2,$3,$4,$5。祝你好运。

标签： awk sed data-manipulation

【解决方案1】：

您说您的输入是 CSV（逗号分隔值），但其中没有逗号，而字段之间确实有明显的随机间距，所以我假设您实际上是指 TSV（制表符分隔值）。如果是这样，那么这应该做你想要的：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR > 1 {
    for (i=5; i<=NF; i++) {
        new = ""
        old = $i
        sub(/#.*/,"",old)
        while ( match(old,/[[:upper:]][[:lower:]]+/) ) {
            new = new substr(old,1,RSTART-1) " " substr(old,RSTART,RLENGTH)
            old = substr(old,RSTART+RLENGTH)
        }
        new = new old
        $i = toupper(substr(new,1,1)) substr(new,2)
    }
}
{ print }

$ awk -f tst.awk file
source  raised_time     cleared_time    cause   pcause  sproblem
source1 rtime1  ctime1  cause1  Communications Subsystem Failure        OMCI Failure
source2 rtime2  ctime2  cause2  Equipment Malfunction   Device Not Active

【讨论】：

非常感谢。这很好用，而且非常简单。

【解决方案2】：

一个 GNU sed 实现，假设输入文件格式是 tsv（制表符分隔值）：

sed -E '1! {
s/\t/\n/4
h
s/[^\n]*//
s/#[^\t]*//g
s/\B[[:upper:]][[:lower:]]/ &/g
s/\b[[:lower:]]/\U&/g
H
g
s/\n.*\n/\t/
}' file.tsv

如果字段由, 分隔，则只需将\t 替换为,。
如果字段由非空白到空白转换分隔，则将 s/^\s+//; s/\s+$//; s/\s+/\t/g 放在 sed 表达式的开头。

【讨论】：

谢谢。这也很好用，但我想我会使用 awk 方法，因为它更容易理解。