【发布时间】:2014-11-20 11:11:35
【问题描述】:
我有一个巨大的日志文件,其中包含超过 100M 的字符串。 它包含 19 列:
time | date | host | user | domain | category | source | port | URL | etc
示例:
time date host user domain category source port URL etc
2:10:21 18.11.2014 192.168.56.101 %username1% %domainname% "many words" stackoverflow.com "80" http://stackoverflow.com/
2:10:22 18.11.2014 192.168.56.101 %username2% %domainname% "done" stackoverflow.com "80" http://stackoverflow.com/
2:10:23 18.11.2014 192.168.56.101 %username3% %domainname% "denied site" stackoverflow.com "80" http://stackoverflow.com/
2:10:24 18.11.2014 192.168.56.101 %username4% %domainname% "suspicious" stackoverflow.com "80" http://stackoverflow.com/
2:10:25 18.11.2014 192.168.56.101 %username5% %domainname% "uncategorized" stackoverflow.com "80" http://stackoverflow.com/
2:10:26 18.11.2014 192.168.56.101 %username6% %domainname% "denied site" stackoverflow.com "80" http://stackoverflow.com/
2:10:27 18.11.2014 192.168.56.101 %username7% %domainname% "many words" stackoverflow.com "80" http://stackoverflow.com/
当我尝试在列中查找字符串时,有时它看起来很糟糕:
user@stand-01:~/folder$cat file |awk '{FS=" ";print$6}'
category
"many
"done"
"denied
"suspicious"
"uncategorized"
"denied
"many
所以当我尝试第 7 列时,它有来自另一列的数据:
user@stand-01:~/folder$cat file |awk '{FS=" ";print$7}'
source
words"
stackoverflow.com
site"
stackoverflow.com
stackoverflow.com
site"
words"
如何使用空格分隔符并避免用引号分隔文本?
【问题讨论】:
-
与其为此寻找复杂的正则表达式,不如更改此文件的写入方式,使其以逗号分隔(csv)、制表符分隔等。也就是说,不存在于字段中。否则很可能在未来给你带来更多问题。
-
你是说这个
awk -v FS="\"" '{print $2}' file吗? -
您的文件制表符是用制表符分隔的,而不是空格分隔的。使用
head -1 logFile | cat -vte命令检查。