如何使用 awk 处理文件答案

【问题标题】：How to treat a file use awk如何使用 awk 处理文件
【发布时间】：2017-04-08 00:19:16
【问题描述】：

我想使用 awk 读取文件，但我卡在第四个字段上，它会在逗号后自动中断。

数据：- test.txt

"A","B","ls","This,is,the,test"
"k","O","mv","This,is,the,2nd test"
"C","J","cd","This,is,the,3rd test"

cat test.txt | awk -F , '{ OFS="|" ;print $2 $3 $4 }'

输出

"B"|"ls"|"This
"O"|"mv"|"This
"J"|"cd"|"This

但是输出应该是这样的

"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

任何想法

【问题讨论】：

awk 无法识别引号。您需要使用其他工具，或者自己编写一个自定义的拆分函数。

标签： shell awk

【解决方案1】：

使用awk，也可以使用：

awk -F'\",\"' 'BEGIN{OFS="\"|\""}{print "\""$2,$3,$4}' filename

注意：这仅在字符串之间找不到"," 的情况下才有效。也就是用作字段分隔符。

输出：

"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

或

稍微好一点：

awk -F'^\"|\",\"|\"$' 'BEGIN{OFS="\"|\""}{print "\""$3,$4,$5"\""}' filename

【讨论】：

【解决方案2】：

使用 GNU awk 进行 FPAT：

$ awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS='|' '{print $2,$3,$4}' file
"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

见http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content

与其他 awk 一起使用：

$ cat tst.awk
BEGIN { OFS="|" }
{
    nf=0
    delete f
    while ( match($0,/([^,]+)|(\"[^\"]+\")/) ) {
        f[++nf] = substr($0,RSTART,RLENGTH)
        $0 = substr($0,RSTART+RLENGTH)
    }
    print f[2], f[3], f[4]
}

$ awk -f tst.awk file
"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

【讨论】：

谢谢...亲爱的...(+1 ) 我将删除我的答案，因为读者应该选择最好的便携式解决方案...感谢您教我新东西...快乐编码
Ed Morton，假设我正在尝试通过上述方法处理这种类型的行 "A","B","ls",,"This,is,the,test" 然后我'我不打印 4 美元。任何想法。

【解决方案3】：

在awk:

awk -F'"' '{for(i=4;i<=9;i+=2) {if(i==4){s="\""$i"\""}else{s = s "|\"" $i"\""}}; print s}' test.txt

说明

-F'"' 表示逗号分隔的字段

awk 解释：

{
## use for-loop to go over fields
## skips the comma field (i.e. increment by +2)
## OP wanted to start at field 2, this means the 4th term
## OP wanted to end at field 4, this means the 8th term
for(i=4;i<=8;i+=2) {

    if(i==4){
        ## initialization
        ## use variable s to hold output (i.e. quoted first field $i)
        s="\"" $i "\""
    } else {
        ## for rest of field $i,
        ## prepend '|' and add quotes around $i
        s = s "|\"" $i "\""
    }
};

## print output
print s 
}

【讨论】：

循环应该从编号 4 开始，跳过第一个字段，正如 OP 在他的问题中指出的那样。
@Birei OP 说他的代码卡在 4 - 我尝试通过使用 "（不是 ,）作为字段 sep 来解决这个问题
我的意思是你的命令打印第一个字段，而在示例中他似乎想跳过它，不管逗号有什么问题。
csiu@ 感谢 4 位您的回复，但我只想从文件中提取 2,3 和 4 个字段，但上面的代码返回文件中的所有字段。
@akhileshchand -- 哎呀，没注意。我改成for循环条件了……现在呢？

【解决方案4】：

我不太喜欢 awk 做这种任务。我的建议是使用csv 解析器，例如python 有一个内置模块来处理这个问题。你可以像这样使用它：

import csv
import sys

with open(sys.argv[1], 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout, delimiter='|', quoting=csv.QUOTE_ALL)
    for row in csvreader:
        csvwriter.writerow(row[1:])

然后像这样运行它：

python3 script.py infile

这会产生stdout：

"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

【讨论】：

【解决方案5】：

awk '{sub(/^..../,"")gsub(/","/,"\042""|""\042")}1' file

"B"|"ls"|"This,is,the,test"
"O"|"mv"|"This,is,the,2nd test"
"J"|"cd"|"This,is,the,3rd test"

【讨论】：