替换大文件中引用字符串中的换行符答案

【问题标题】：Replace newline in quoted strings in huge files替换大文件中引用字符串中的换行符
【发布时间】：2022-01-25 07:38:47
【问题描述】：

我有一些巨大的文件，它们的值由竖线 (|) 符号分隔。我们引用的字符串，但有时在引用的字符串之间会有换行符。

我需要使用 oracle 的外部表读取这些文件，但在换行符上他会给我错误。所以我需要用空格替换它们。

我对这些文件执行了一些其他 perl 命令来解决其他错误，所以我想在一行 perl 命令中找到解决方案。

我在 stackoverflow 上发现了一些其他类似的问题，但它们的作用并不完全一样，我无法通过那里提到的解决方案找到解决问题的方法。

我尝试过但不起作用的语句：

perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt

示例文本：

4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....

应该变成：

4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....

【问题讨论】：

您正在阅读一行。该模式无法匹配，因为您需要它匹配行尾之后的字符。使用-0777 告诉 Perl 将整个文件视为一行的简单解决方案。这对您来说可能是个问题（“大文件”）。
@WiktorStribiżew：我添加了一个示例
@ikegami：你能给我一个单行子句，然后将文件完整地处理吗？
可能是perl -pi -e 's/(\r?\n)(?!\d{4,}\|)/ /g' test.txt（跳过后跟数字的）
或perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' file

标签： regex perl awk sed

【解决方案1】：

听起来你想要一个像Text::CSV_XS 这样的 CSV 解析器（通过你的操作系统的包管理器或最喜欢的 CPAN 客户端安装）：

$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
  $csv->say(*STDOUT, [ map { tr/\n/ /r } @$row ]) 
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "

此单行程序使用| 作为字段分隔符而不是普通逗号来读取每条记录，并且对于每个字段，用空格替换换行符，然后打印出转换后的记录。

【讨论】：

正是这个。（好吧，我会选择tr/\n/ / for @$row; $csv->say(*STDOUT, $row);，但它节省的时间并没有那么大。）如果可以的话，我会投票两次。
这里完全一样 :) 一个完美该库实用程序的示例

【解决方案2】：

在您的具体情况下，您还可以考虑使用 GNU sed 或 awk 的解决方法。

awk 命令看起来像

awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile

ORS（输出记录分隔符）设置为空字符串，这意味着 \n 仅添加在以四位或更多数字开头的行之前，后跟 | 字符（与 ^[0-9]{4,}\| POSIX ERE 匹配模式）。

GNU sed 命令看起来像

sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file

这会将两行连续的行读入模式空间，一旦第二行不是以四位数字开头，后跟| 字符（请参阅[0-9]\{4\}| POSIX BRE 正则表达式模式），则这两个被一个空格代替。重复搜索和替换，直到没有匹配或文件结束。

使用perl，如果文件很大但仍然可以放入内存，您可以使用短文件

perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g'  <<< "$s"

使用-0777，您的slurp the file 和\R++(?!\d{4,}\|) 模式匹配任何一个或多个换行符(\R++)，而不是后跟四个或更多数字后跟| 字符。 ++ 占有量词需要使 (?!...) 负前瞻，以禁止回溯到换行匹配模式。

【讨论】：

【解决方案3】：

使用您展示的示例，这可以简单地在awk 程序中完成。在 GNU awk 中编写和测试，应该可以在任何 awk 中工作。即使在大文件上也应该能快速工作（比将整个文件吞入内存要好，提到过 OP 可能会在大文件上使用它）。

awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file

说明：为上述添加详细说明。

awk '                                ##Starting awk program from here.
gsub(/"/,"&")%2!=0{                  ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
  if(val==""){ val=$0            }   ##Checking condition if val is NULL then set val to current line.
  else       {print val $0;val=""}   ##Else(if val NOT NULL) then print val current line and nullify val here.
  next                               ##next will skip further statements from here.
}
1                                    ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file                         ##Mentioning Input_file name here.

【讨论】：