关于 UNIX Grep 命令答案

【问题标题】：Regarding UNIX Grep Command关于 UNIX Grep 命令
【发布时间】：2010-02-22 08:32:24
【问题描述】：

我需要编写一个 shell 脚本来选择 /exp/files 目录中的所有文件（不是目录）。对于目录中的每个文件，我想查找是否收到文件的最后一行。文件中的最后一行是预告片。最后一行的第三个字段是数据记录数，即 2315（文件中的总行数 -2 (header,trailer) ）。在我的 unix shell 脚本中，我想通过检查 T 来检查最后一行是否是预告片记录，并想检查文件中的行数是否等于 (2315+2)。如果这成功了，那么我想将文件移动到不同的目录 /exp/ready。

tail -1 test.csv 
T,Test.csv,2315,80045.96

同样在输入文件中，有时 0 或 1 个预告记录字段可以在双引号内

"T","Test.csv","2315","80045.96"
"T", Test.csv, 2212,"80045.96"
T,Test.csv,2315,80045.96

【问题讨论】：

好的，问题解决了。请参阅我的原始帖子 stackoverflow.com/questions/2309673/regarding-unix-grep-command/… 并在 Update 下查看我必须做些什么来修复它以及为什么

标签： shell awk unix

【解决方案1】：

您可以使用以下内容测试最后一行是否存在：

tail -1 ${filename} | egrep '^T,|^"T",' >/dev/null 2>&1
rc=$?

此时，如果行以T, 或"T", 开头，则$rc 将为0，假设这足以捕获预告片记录。

确定后，您可以使用以下方法提取行数：

lc=$(cat ${filename} | wc -l)

您可以通过以下方式获得预期行数：

elc=$(tail -1 ${filename} | awk -F, '{sub(/^"/,"",$3);print 2+$3}')

比较两者。

因此，将所有这些结合在一起，这将是一个好的开始。它会输出文件本身（我的测试文件num[1-9].tst）以及指示文件是否正常或为什么不正常的消息。

#!/bin/bash
cd /exp/files
for fspec in *.tst ; do
    if [[ -f ${fspec} ]] ; then
        cat ${fspec} | sed 's/^/   /'
        tail -1 ${fspec} | egrep '^T,|^"T",' >/dev/null 2>&1
        rc=$?
        if [[ ${rc} -eq 0 ]] ; then
            lc=$(cat ${fspec} | wc -l)
            elc=$(tail -1 ${fspec} | awk -F, '{sub(/^"/,"",$3);print 2+$3}')
            if [[ ${lc} -eq ${elc} ]] ; then
                echo '***' File ${fspec} is done and dusted.
            else
                echo '***' File ${fspec} line count mismatch: ${lc}/${elc}.
            fi
        else
            echo '***' File ${fspec} has no valid trailer.
        fi
    else
        ls -ald ${fspec} | sed 's/^/   /'
        echo '***' File ${fspec} is not a regular file.
    fi
done

示例运行，显示我使用的测试文件：

   H,Test.csv,other rubbish goes here
   this file does not have a trailer
*** File num1.tst has no valid trailer.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with all quotes and correct count
   "T","Test.csv","1","80045.96"
*** File num2.tst is done and dusted.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with all quotes but bad count
   "T","Test.csv","9","80045.96"
*** File num3.tst line count mismatch: 3/11.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with all quotes except T, and correct count
   T,"Test.csv","1","80045.96"
*** File num4.tst is done and dusted.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with no quotes on T or count and correct count
   T,"Test.csv",1,"80045.96"
*** File num5.tst is done and dusted.
   H,Test.csv,other rubbish goes here
   this file does have a traier with quotes on T only, and correct count
   "T",Test.csv,1,80045.96
*** File num6.tst is done and dusted.
   drwxr-xr-x+ 2 pax None 0 Feb 23 09:55 num7.tst
*** File num7.tst is not a regular file.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with all quotes except the bad count
   "T","Test.csv",8,"80045.96"
*** File num8.tst line count mismatch: 3/10.
   H,Test.csv,other rubbish goes here
   this file does have a trailer with no quotes and a bad count
   T,Test.csv,7,80045.96
*** File num9.tst line count mismatch: 3/9.

【讨论】：

我的尾巴记录可以是以下之一。 "T", Test.csv, 2212,"80045.96" T,Test.csv,"2212",80045.96 这一个 elc=$(tail -1 ${fspec} | awk -F, '{print 2+$3} ') 处理记录数 2212 是否出现带或不带双引号？如果不是，我该如何修改它？
很好，@arav，我真的应该先测试我的代码，然后再将它施加给毫无戒心的公众 :-) 新代码应该可以解决这个问题（我已经添加了一些单元测试，希望能给你也是一些信心的衡量标准）。
@paxdiablo：你可以通过去掉所有不需要的管道来提高你的代码效率。前任。 cat ${fspec} | sed 's/^/ /' 可以简化为 sed 's/^/ /' "$fspec" 和 lc=$(cat ${fspec} | wc -l) 到 lc=$(wc -l < "$fspec")。此外，在处理可能包含空格的字符串时，始终引用变量非常重要。
我很久以前就养成了使用 cat 启动管道的习惯，仅仅是因为它对我来说看起来“更干净”（管道的每个其他阶段都是纯 stdin/stdout 进程），并且习惯会随着年龄。我意识到它的效率较低，但我很少关心 shell 脚本：与实际处理相比，额外的 stdout-stdin 连接的成本通常很小。但是点了。我还积极追查并杀死那些在文件名中添加空格的人 :-) 你不会在我管理的系统上看到任何这些可憎的东西。
非常感谢。下面的行是做什么的？ ls -ald ${fspec} | sed 's/^/ /' 我会试试这个程序让你知道

【解决方案2】：

如果你想在文件被写入并关闭后移动它们，那么你应该考虑使用类似 inotify、incron、FAM、gamin 等的东西。

【讨论】：

【解决方案3】：

这段代码通过一次调用 awk 来完成所有的逻辑计算，这使得它非常高效。它还 NOT 硬编码 2315 的示例值，而是使用预告片行中包含的值，因为我相信这是您的意图。

如果您对结果满意，请记得删除echo。

#!/bin/bash

for file in /exp/files/*; do
  if [[ -f "$file" ]]; then
    if nawk -F, '{v0=$0;v1=$1;v3=$3}END{gsub(/"/,"",v0);exit !(v1 == "T" && NR == v3+2)}' "$file"; then
      echo mv "$file" /ext/ready
    fi
  fi
done

更新

我必须添加{v0=$0;v1=$1;v3=$3}，因为 SunOS 的 awk 实现不支持 END{} 可以访问字段变量（$0、$1、$2 等），但如果您必须将其保存到用户定义的变量中想在 END{} 内处理它们。查看This awk feature comparison link中第一个表的最后一行

【讨论】：

gsub 是做什么的？ awk中的exit是否会跳出for循环？
gsub() 是去掉引号（如果存在的话）。 exit() 实际上是 awk 命令的一部分，而不是 bash。所以不，它不会跳出 for 循环，而是设置 awk 的返回值，如 bash 所见——如果匹配则为“0”，如果不匹配则为“1”。
您确实应该首先检查常规文件。您的（聪明的，我承认）技巧，在 catting 目录时丢弃 stderr 以消除错误，对于使用 mkfifo 制作的管道（例如）来说效果不佳。它永远无法读取该管道。但是，仍然是一个优雅的解决方案。
好建议，更新代码以反映。我保留了 stderr 的重定向以隐藏任何类型的权限被拒绝问题。如果您想查看这些内容，只需删除 2>/dev/null 部分即可。
非常感谢。 SiegeX，“在使用 mkfifo 制成的管道上效果不佳”。这是什么意思？本准则适用于这种情况吗？

【解决方案4】：

这里没有 UNIX shell，但是

#!/bin/bash
files=$(find /exp/files -type f)

应该把所有文件放在一个 BASH 数组中；然后按照上面建议的 paxdiablo 遍历它们中的每一个应该让你排序

【讨论】：

【解决方案5】：

destination=/exp/ready
for file in /exp/files/*.csv
do
    var=$(tail -1 "$file" | awk -F"," '{ gsub(/\042|\047/,"") }
    $1=="T" && $3 == "2315" { print "ok" }')
    if [ "$var" = "ok" ]; then
        echo mv "$file" "$destination"
    else
        echo "invalid: $file"
    fi
done

【讨论】：

【解决方案6】：

#!/bin/bash

ex findready.sh <<'HERE'
  i#!/bin/bash/

  let NUMLINES=$(wc -l $1)
  let TRAILER=$(cat $1 | tail -1 | tr -d '"' | sed 's/^\(.\).*$/\1/')

  if [[ $NUMLINES -eq 2317 && $TRAILER == "T" ]] ; then
      mv $1 /exp/ready/$1
  fi
  .
  wq
HERE

chmod a+x findready.sh

find /exp/files/ -type f -name '*.csv' -exec ./findready.sh {} ';' > /dev/null 2>&1

【讨论】：