如何删除文件中 int 值之间的间隙？答案

【问题标题】：how do I remove gaps between int values in a file?如何删除文件中 int 值之间的间隙？
【发布时间】：2020-06-12 15:26:15
【问题描述】：

给定一个包含两列整数的文件，我想消除整数值之间的间隙。所谓间隙，我的意思是如果我们取两个整数 A 和 B，那么就没有像 A 这样的 C

对此：

在前两列中，当前整数是 {1,2,3,5,6,7,9,11}。缺失值为 {4,8,10}。目标是通过小于它的缺失值的数量来减少每个整数。所以 5,6 和 7 减少了 1， 9 us 减少了 2， 11 减少了 3。所以值 {1,2,3,5,6,7,9,11} 被 {1,2,3,4,5,6,7,8} 替换。有谁知道如何有效地做到这一点，使用 linux 命令、bash 脚本或 awk 命令？谢谢！

编辑：我尝试这样做，但我没有找到在 shell 脚本中执行此操作的方法，我不得不编写一个执行 shell 脚本的 c 程序。第一部分只是对文件进行排序，第二部分执行我在问题中谈到的内容。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>


#define MAX_INTS 100000000

void process_file(char *path){
    //FIRST PART
    char *outfpath="tmpfile";
    char *command=calloc(456+3*strlen(path)+strlen(outfpath),sizeof(char));

    sprintf(command,"#!/bin/bash \nvar1=$( cat %s | head -n 4  && ( cat %s | tail -n +5  | awk '{split( $0, a, \" \" ); asort( a ); for( i = 1; i <= length(a); i++ ) printf( \"%c%c \", a[i] ); printf( \"\\n\" ); }' | sort -n -k1,1 -k2 | uniq) )\nvar2=$( ( (echo \"$var1\" | tail -n +5 | cut -f 1 -d\" \") && (echo \"$var1\" | tail -n +5 | cut -f 2 -d\" \" ) ) | sort -n -k1,1 | uniq | awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' )\necho \"$var1\" > %s\necho \"$var2\"| tr \"\\n\" \" \" > %s",path,path,'%','s',path,outfpath);

    if(system(command)==-1){
        fprintf(stderr,"Erreur à l'exécution de la commande \n%s\n",command);
    }
    //the first part only sorts the file and puts in outpath the list of the missing integers

    //SECOND PART
    long unsigned start=0,end=0,val,index=0;
    long unsigned *intvals=calloc(MAX_INTS,sizeof(long unsigned));
    FILE *f=fopen(outfpath,"r");

    //reads the files and loads the missing ints to the array intvals
    while(fscanf(f,"%lu ",&val)==1){
        end=index;
        intvals[index]=val;
        index++;
    }
    if (index==0) return;
    intvals=realloc(intvals,index*sizeof(long unsigned));
    fclose(f);
    free(command);


    f=fopen(path,"r+w");
    char *line=calloc(1000,sizeof(char));
    command=calloc(1000,sizeof(char));
    char *str;
    long unsigned v1,v2,
        d1=0,d2=0,
        c=0,prec=-1,start_l=0;
    int pos1, pos2;  

    //read a file containing two columns of ints 
    //for each pair v1 v2, count d1 d2, 
    //such as d1 is the number of missing values smaller than v1, d2 the number of missing values smaller than v2
    //and overrwrite the line in the file using sed with the values v1-d1 and v2-d2

    while(fgets(line,1000,f)!=NULL && line[0]=='#'){ continue; }

    do{
        str=strtok(line," \t");
        v1=atoi(str);
        str=strtok(NULL," \t");
        v2=atoi(str);
        if(prec!=v1) {
            prec=v1;
            d2=d1;
            start_l=start;
        }
        for(index=start;index<=end;index++){ 
            if(intvals[index]<v1){ 
                 d1++; 
                 start++;
                 c=1;
            }else{
                start=d1;
                break;
            }
        }
        for(index=start_l;index<=end;index++){ 
            if(intvals[index]<v2){ 
                d2++; 
                start_l++;
                c=1; 
            }else{ 
                break;
            }
        }         
        if(c){
            sprintf(command,"sed -i 's/%lu %lu/%lu %lu/' %s",v1,v2,v1-d1,v2-d2,path);
            if(system(command)==-1){
                fprintf(stderr,"Erreur à l'exécution de la commande \n%s\n",command);
            }
        }
        c=0;
    }while(fgets(line,1000,f)!=NULL);
    fclose(f);
    free(command);
    free(line);
    free(intvals);
}



int main(int argc,char* argv[]){

    process_file(argv[1]);
    return 0;
}

【问题讨论】：

@kvantour 我明白了，但问题是，在我看来，这很简单，我似乎没有找到有效的方法，我添加了我试图做的事情
@Inian 我添加了我尝试过的东西，它不起作用的原因是执行时间太长，因为我使用 sed 将每一行替换为新值，时间复杂度是 n^2，我正在寻找一种更有效的方法
我想我最大的问题是 - 为什么？您是否尝试创建特定的输出？然后忽略这个，只写那个输出。如果您具体要做的是根据规则编辑此文件，那么我并没有清楚地理解规则。
下面的答案中有一些好的想法。您选择在这组复杂的c 代码中调用sed 表明您不熟悉awk。它使像您这样的任务能够在一个进程中处理所有任务，并将显着减少您的运行时间。使用 JohnBrown 的解决方案，您甚至可以使用内置功能来减少您的代码库。不确定您的数据是如何工作的，但希望您知道 *nix utility tsort（地形排序）。它可能是您工具箱的另一个好工具。祝你好运！
哦，还有 ++ 通过显示您的代码来改进您的 Q。祝你好运。

标签： linux bash shell awk command-line

【解决方案1】：

这可能会做到：

awk '(NR==FNR){for(i=1;i<=NF;++i) {a[$i]; max=(max<$i?$i:max)};next}
     (FNR==1) {for(i=1;i<=max;++i) if(i in a) a[i]=++c }
     {for(i=1;i<=NF;++i) $i=a[$i]}1' file file

如果file 有作为输入：

上面的命令会返回：

此方法的想法是跟踪数组a，该数组由旧值索引并返回新值：a[old]=new。我们扫描文件两次并将所有可能的值存储在a[old] 中。当我们第二次读取文件时，我们首先检查新值将是什么。完成后，我们只需使用新值更新所有字段并打印结果。

以上也可以通过单次读取文件来完成，只需要缓冲一下即可：

awk '{b[FNR]=$0;for(i=1;i<=NF;++i) {a[$i]; max=(max<$i?$i:max)}}
     END {
        for(i=1;i<=max;++i) if(i in a) a[i]=++c
        for(n=1;n<=FNR;++n) {
          $0=b[n]
          for(i=1;i<=NF;++i) $i=a[$i]
          print
        }
     }' file

【讨论】：

不是我，但我只是尝试过它并没有给出请求的输出：pastebin.com/LUYzMRcp
您的文件是什么样的？（请参阅我的回答了解我的输入）

【解决方案2】：

使用 GNU awk 和 asorti():

$ gawk '{                         # GNU awk only or implement sort
    a[$1];a[$2]                   # hash field values to a array
    f1[NR]=$1;f2[NR]=$2           # hash fields $1 and $2 index on NR
}
END {                             # after all data is hashed
    asorti(a,a,"@ind_num_asc")    # sort index of a where the values are
    for(i in a)                   # make a reverse map 
        b[a[i]]=i
    for(i=1;i<=NR;i++)            # iterate the stored "records"
        print b[f1[i]],b[f2[i]]   # print and fetch from reverse map
}' file

a[] 存储唯一字段值：a[6] a[5] 然后asorti() 重新索引a[]：a[1]=5 a[2]=6，我们得到相应的新值。 b[] 是a[] 的反向映射：b[5]=1 b[6]=2 用于在输出时获取旧字段值的新值。

输出：

【讨论】：

【解决方案3】：

假设您的输入如下所示：

输入.txt

注意：col1 中没有 3，col2 中没有 8，只是为了便于跟踪。

然后将每一列单独排序并存储：

$sort -k1,1 input.txt | awk '{ print $1}'  > 1_sorted
$cat 1_sorted
1
2
4
5
6
7
8
9


$sort -k2,2 input.txt | awk '{ print $2}'  > 2_sorted
$cat 2_sorted
1
2
3
4
5
6
7
9

现在只需合并两个文件：

$paste -d' ' 1_sorted 2_sorted > merged_again

$ cat merged_again
1 1
2 2
4 3
5 4
6 5
7 6
8 7
9 9

可能有更高效/更优雅的方法，但我现在想不出。

【讨论】：

我想你没有理解我的问题，请看一下我的例子和我的程序。值的位置无关紧要，重要的是如果我们取任意两个不同的整数 A 和 B，其中 A>B 并且文件中没有整数 C，例如 CB，那么 AB=1 .为此，我列出了文件中最小和最大整数之间的所有整数，我的意思是不存在于文件中。然后对于文件中的每个整数，我将其值减去小于它的缺失整数的数量。
很明显。我认为@kvantour 是您想要的正确解决方案