bash 来识别和验证文件头答案

【问题标题】：bash to identify and verify file headersbash 来识别和验证文件头
【发布时间】：2017-04-17 17:09:06
【问题描述】：

使用下面的tab-delimitedfile，我试图验证标题行1，然后将该数字存储在变量$header 中，以便在几个if 语句中使用。如果 $header 等于 10 则 file has expected number of fields，但如果 $header 小于 10 file is missing header for: 并且缺少的标题字段打印在下面。 bash 似乎很接近，如果我单独使用awk，它似乎工作得很好，但我似乎无法在if 中使用它。谢谢你:)。

文件.txt

Index   Chr Start   End Ref Alt Freq    Qual    Score   Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2.txt

Index   Chr Start   End Ref Alt Freq    Qual    Score
1    1    1    100    C    -    1    GOOD    10
2    2    20    200    A    C    .002    STRAND BIAS    2
3    2    270    400    -    GG    .036    GOOD    6

重击

for f in /home/cmccabe/Desktop/validate/*.txt; do
   bname=`basename $f`
   pref=${bname%%.txt}
   header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output  # detect header row in file and store in header and write to output
       if [[ $header == "10" ]]; then   # display results
          echo "file has expected number of fields"   # file is validated for headers
      else
          echo "file is missing header for:"  # missing header field ...in file not-validated
          echo "$header"
      fi  # close if.... else    
done >> ${pref}_output

file.txt 的期望输出

file has expected number of fields

file1.txt 的期望输出

file is missing header for:
Input

【问题讨论】：

是 10 固定或应该根据标题记录而变化
10 是固定的，因为这是所需的标头字段的数量。如果NF 小于该值，则打印缺少的NF。谢谢你:)。
如果一个文件有 11 列或更多列怎么办？这是多大的问题？到目前为止，您已经表达了对 9 列或更少列的担忧，但您也可能拥有太多列。

标签： bash

【解决方案1】：

如果您愿意，可以使用awk，但bash 完全能够自行处理第一行字段比较。如果您维护一个预期字段名称的数组，则可以轻松地将第一行拆分为多个字段，与预期的字段数进行比较，如果您从任何给定的字段中读取的字段数少于预期的字段数，则输出缺失字段的标识文件。

以下是一个将文件名作为参数的简短示例（对于大量文件，您需要从stdin 获取文件名，或根据需要使用xargs）。该脚本只是读取每个文件中的第一行，将行分隔为字段，检查字段计数，并在简短的错误消息中输出任何缺少的字段：

#!/bin/bash

declare -i header=10    ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
         "Chr"
         "Start"
         "End"
         "Ref"
         "Alt"
         "Freq"
         "Qual"
         "Score"
         "Input" )

for i in "$@"; do           ## for each file given as argument
    read -r line < "$i"     ## read first line from file into 'line'

    oldIFS="$IFS"           ## save current Internal Field Separator (IFS)
    IFS=$'\t'               ## set IFS to word-split on '\t'

    fldarray=( $line );     ## fill 'fldarray' with fields in line

    IFS="$oldIFS"           ## restore original IFS

    nfields=${#fldarray[@]} ## get number of fields in 'line'

    if (( nfields < header ))   ## test against header
    then
        printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
        for j in "${fields[@]}" ## for each expected field
        do  ## check against those in line, if not present print
            [[ $line =~ $j ]] || printf " %s" "$j"
        done
        printf "\n\n"   ## tidy up with newlines
    fi
done

示例输入

$ cat dat/hdr.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score   Input
1       1       1       100     C       -       1       GOOD    10      .
2       2       20      200     A       C       .002    STRAND BIAS     2       .
3       2       270     400     -       GG      .036    GOOD    6       .

$ cat dat/hdr2.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score
1       1       1       100     C       -       1       GOOD    10
2       2       20      200     A       C       .002    STRAND BIAS     2
3       2       270     400     -       GG      .036    GOOD    6

$ cat dat/hdr3.txt
Index   Chr     Start   End     Alt     Freq    Qual    Score   Input
1       1       1       100     -       1       GOOD    10      .
2       2       20      200     C       .002    STRAND BIAS     2       .
3       2       270     400     GG      .036    GOOD    6       .

使用/输出示例

$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input

error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref

仔细观察，虽然 awk 可以做很多 bash 无法单独完成的事情，但 bash 能够解析文本。

【讨论】：

【解决方案2】：

这是 GNU awk 中的一个 (nextfile)：

$ awk '
FNR==NR {
    for(n=1;n<=NF;n++)
        a[$n]
    nextfile
}
NF==(n-1) {
    print FILENAME " file has expected number of fields"
    nextfile
}
{
    for(i=1;i<=NF;i++)
        b[$i]
    print FILENAME " is missing header for: " 
    for(i in a)
    if(i in b==0)
        print i
    nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for: 
Input

脚本处理的第一个文件定义了以下文件应具有的标头（a）并将它们（b）与它进行比较。

【讨论】：

感谢大家提供的出色解决方案，它们都很完美:)。

【解决方案3】：

这段代码将完全按照您的要求进行。请让我知道这对你有没有用。

 for f in ./*.txt; do

      [[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]]  && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index   Chr Start   End Ref Alt Freq    Qual    Score   Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c |  awk '/1 / {print $2}' ); 
 done

输出：

File ./file2.txt is missing  Input
File ./file.txt has all the fields on its header

【讨论】：

我不禁认为“单线”最好显示为 if、then 和 else（和 fi）分布在多行。我对该行远端的脚本有所保留，但它不可读，因为它对于单行来说太长了。请明智地使用换行符 - 并且比现在更丰富。
从grep 到awk 的管道也是一种代码味道——因为这意味着一个人没有很好地利用awk（这可能琐碎地做grep 的工作：awk '/1 / {print $2}')。这里还有大量缺失的引号，shellcheck.net 会捕捉到。
@JonathanLeffler 是的，我知道。我并没有试图变得很整洁。只是有效率。它完成了。
@CharlesDuffy 你说得对，Charles，一个“grep to awk”很俗气。我现在编辑它。谢谢！ :)
我看不出代码中哪里存在（计算机）效率问题，可以通过将其全部格式化为一行来解决。从人类效率的角度来看，一体式编码使其难以阅读。当然，这只是我的看法，但是……（当您可以查看代码时，发现grep | awk 类型效率低下会更容易。）