【问题标题】:Search and replace string in a very big file在一个非常大的文件中搜索和替换字符串
【发布时间】:2016-05-01 17:40:39
【问题描述】:

我喜欢使用 shell 命令来完成任务。我有一个非常非常大的文件——大约 2.8 GB,内容是 JSON。一切都在一条线上,有人告诉我那里至少有 150 万条记录。

我必须准备文件以供使用。每条记录必须在自己的行上。示例:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

或者,使用以下...

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

最终结果应该是:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

尝试的命令:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

尝试的命令对小文件非常有效。但它不适用于我必须操作的 2.8 GB 文件。 Sed 在 10 分钟后中途退出,没有任何原因。几个小时后,Awk 因分段错误(核心转储)原因出错。我尝试了 perl 的搜索和替换,得到一个错误提示“内存不足”。

任何帮助/想法都会很棒!

我的机器上的其他信息:

  • 超过 105 GB 的可用磁盘空间。
  • 8 GB 内存
  • 4 核 CPU
  • 运行 Ubuntu 14.04

【问题讨论】:

  • 我们需要一些更好的样本数据——不一定是您的数据转储,而是说明手头的问题。另外 - 你考虑过使用解析器吗?
  • 我认为,基本问题是所有这 3 个工具都一次读取一行,因此被“单个巨大的行”所淹没。首先尝试使用tr ',' '\012' 之类的东西进行预处理,以用换行符替换逗号。然后,一次一行的工具会更好地工作。
  • 使用 perl 再试一次,但将 $/ 设置为 ","。也可以试试 sed 的“-u”参数(--unbuffered)。
  • 正确的解决方法是向我们展示您使用的 Awk 和 Perl 程序,并帮助我们帮助您修复它们。
  • @dat789 ... 好。当我建议in my answer 你可以使用perl 时,我并不是说你应该像sed 一样使用它。我指的是decode_json()之类的东西。如果您使用的语言可以真正理解您的数据结构,那么请使用这些工具!

标签: json perl awk large-files data-manipulation


【解决方案1】:

尝试使用} 作为记录分隔符,例如在 Perl 中:

perl -l -0175 -ne 'print $_, $/' < input

您可能需要将仅包含 } 的行粘贴回去。

【讨论】:

    【解决方案2】:

    这通过不将数据视为单个记录来避免内存问题,但在性能方面可能会走得太远(一次处理单个字符)。另请注意,它需要 gawk 用于内置 RT 变量(当前记录分隔符的值):

    $ cat j.awk
    BEGIN { RS="[[:print:]]" }
    RT == "{" { bal++}
    RT == "}" { bal-- }
    { printf "%s", RT }
    RT == "," && bal == 2 { print "" }
    END { print "" }
    
    $ gawk -f j.awk j.txt
    {"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
    {"RecordId":"2",...},
    {"RecordId":"3",...},
    {"RecordId":"4",...},
    {"RecordId":"5",...} }}
    

    【讨论】:

      【解决方案3】:

      由于您已使用 sed、awk 和 perl 标记了您的问题,我认为您真正需要的是对工具的推荐。虽然这有点离题,但我相信jq 是您可以使用的。它会比 sed 或 awk 更好,因为它实际上理解 JSON。此处使用 jq 显示的所有内容也可以通过一些编程在 perl 中完成。

      假设内容如下(基于您的示例):

      {"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}
      

      您可以轻松地重新格式化它以“美化”它:

      $ jq '.' < data.json
      {
        "RomanCharacters": {
          "Alphabet": [
            {
              "RecordId": "1",
              "data": "data"
            },
            {
              "RecordId": "2",
              "data": "data"
            },
            {
              "RecordId": "3",
              "data": "data"
            },
            {
              "RecordId": "4",
              "data": "data"
            },
            {
              "RecordId": "5",
              "data": "data"
            }
          ]
        }
      }
      

      而且我们可以挖掘数据以仅检索您感兴趣的记录(无论它们包含什么内容):

      $ jq '.[][][]' < data.json
      {
        "RecordId": "1",
        "data": "data"
      }
      {
        "RecordId": "2",
        "data": "data"
      }
      {
        "RecordId": "3",
        "data": "data"
      }
      {
        "RecordId": "4",
        "data": "data"
      }
      {
        "RecordId": "5",
        "data": "data"
      }
      

      这对于人类和 awk 等逐行处理内容的工具来说都更具可读性。如果您想根据您的问题加入您的行进行处理,则 awk 会变得更加简单:

      $ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}'
      {   "RecordId": "1",   "data": "data" }
      {   "RecordId": "2",   "data": "data" }
      {   "RecordId": "3",   "data": "data" }
      {   "RecordId": "4",   "data": "data" }
      {   "RecordId": "5",   "data": "data" }
      

      或者,正如 cmets 中的 @peak 建议的那样,使用 jq 的 -c(紧凑输出)选项完全消除 thie 的 awk 部分:

      $ jq -c '.[][][]' < data.json
      {"RecordId":"1","data":"data"}
      {"RecordId":"2","data":"data"}
      {"RecordId":"3","data":"data"}
      {"RecordId":"4","data":"data"}
      {"RecordId":"5","data":"data"}
      

      【讨论】:

      • 请注意,在这里可以使用 jq 的 -c 选项。在使用 data.json 的示例中,jq -c '.[][][]' &lt; data.json 使用awk 作为后处理器生成上面获得的结果。事实上,我怀疑如果我们对 OP 试图完成的工作有一个更好的了解,那么整个事情可以很容易地在 jq 中以相当经济的方式完成。
      • @peak - 太棒了,谢谢你,出于某种我从未想过的原因。 :-) 我已将其添加到我的答案中。
      【解决方案4】:

      关于 perl:尝试将输入行分隔符 $/ 设置为 },,如下所示:

      #!/usr/bin/perl
      $/= "},"; 
      while (<>){
         print "$_\n"; 
      }'
      

      或者,作为一个单行:

      $ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat 
      

      【讨论】:

      • 这已被证明可以快速工作。我已经修改了您的脚本以找到问题的解决方案。我会尽快分享。
      • 可能值得看看 -n-p 标志。我想你可以这样写:perl -pe 'BEGIN{$/="},"} print "\n"; }
      【解决方案5】:

      使用此处提供的示例数据(以 {Accounts:{Customer... 开头的那个),此问题的解决方案是读取文件并在读取时计算定义的分隔符的数量在 $/.每计算 10,000 个分隔符,它将写入一个新文件。对于找到的每个分隔符,它都会给它一个新行。脚本如下所示:

      #!/usr/bin/perl
      
      $base="/home/dat789/incoming";
      #$_="sample.dat";
      
      $/= "}]},";   # delimiter to find and insert new line after
      $n = 0;
      $match="";
      $filecount=0;
      $recsPerFile=10000;   # set number of records in a file
      
      print "Processing " . $_ ."\n";
      
      while (<>){
         if ($n < $recsPerFile) {
            $match=$match.$_."\n";
            $n++;
            print "."; #This is so that we'd know it has done something
         }    
         else {
            my $newfile="partfile".$recsPerFile."-".$filecount . ".dat";
            open ( OUTPUT,'>', $newfile );
            print OUTPUT $match;
            $match="";
            $filecount++;   
            $n=0;
           print "Wrote file " .  $newfile . "\n";
         }
      }
      
      print "Finished\n\n";
      

      我已针对 2.8 GB 的大文件使用此脚本,其中的内容是未格式化的单行 JSON。生成的输出文件会缺少正确的 JSON 页眉和页脚,但这很容易解决。

      非常感谢你们的贡献!

      【讨论】:

        猜你喜欢
        • 2015-05-19
        • 2014-03-17
        • 2016-10-08
        • 1970-01-01
        • 1970-01-01
        • 2020-05-12
        • 2015-03-19
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多