【问题标题】:Merging csv files with slightly different headers合并标题略有不同的 csv 文件
【发布时间】:2012-12-08 16:20:03
【问题描述】:

我有 2 个 CSV 文件。

一个标题看起来像:

header1,header2,header3,header4
a,b,c,d

另一个标题看起来像:

header1,header3,header4,header5
e,f,g,h

我希望输出为 CSV 文件:

header1,header2,header3,header4,header5
a,b,c,d,
e, ,f,g,h

我更喜欢可以处理这种类型的合并的命令行实用程序(因为它全部由 Windows 中的批处理文件运行),但我愿意接受任何解决方案。

如果标题相同,那么这将很容易,但由于标题略有不同,我已经碰壁了。

任何帮助将不胜感激。

【问题讨论】:

    标签: join csv command-line


    【解决方案1】:

    我有一个可以从控制台执行的solution based in a Ruby script

    【讨论】:

      【解决方案2】:

      只要没有任何列值包含逗号,批处理文件就可以使用 FOR /F 轻松解析大多数 CSV 行。但是 FOR /F 解决方案可能会因缺失值而出错。您的 CSV 可能有连续的逗号,表示缺少值。但 FOR /F 将连续分隔符视为单个分隔符。这个问题可以批量解决,但我觉得不值得。

      PowerShell 可能有一个很好的解析 CSV 的解决方案。我知道 .NET 有一个用于解析 CSV 的类,并且 PowerShell 可以访问 .NET。但我并不真正了解 PowerShell。

      有可用于 Windows 的免费文本处理工具,例如 sed。但这需要下载。

      我编写了一个易于使用的混合批处理/JScript 实用程序,名为 REPL.BAT,它可以对文本文件执行正则表达式搜索和替换。

      假设您在第二个文件中的第一列的值中从来没有带引号的逗号,那么解决方案可以很简单:

      @echo off
      >new.csv (
        echo header1,header2,header3,header4,header5
        findstr /v /c:"header1,header2,header3,header4" file1.csv | repl "^(.*)$" "$1,"
        findstr /v /c:"header1,header3,header4,header5" file2.csv | repl "^([^,]*)," "$1, ,"
      )
      

      这是启用上述解决方案的 REPL.BAT 实用程序。完整的文档已内置到脚本中。

      @if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment
      
      ::************ Documentation ***********
      :::
      :::REPL  Search  Replace  [Options  [SourceVar]]
      :::REPL  /?
      :::
      :::  Performs a global search and replace operation on each line of input from
      :::  stdin and prints the result to stdout.
      :::
      :::  Each parameter may be optionally enclosed by double quotes. The double
      :::  quotes are not considered part of the argument. The quotes are required
      :::  if the parameter contains a batch token delimiter like space, tab, comma,
      :::  semicolon. The quotes should also be used if the argument contains a
      :::  batch special character like &, |, etc. so that the special character
      :::  does not need to be escaped with ^.
      :::
      :::  If called with a single argument of /? then prints help documentation
      :::  to stdout.
      :::
      :::  Search  - By default this is a case sensitive JScript (ECMA) regular
      :::            expression expressed as a string.
      :::
      :::            JScript syntax documentation is available at
      :::            http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx
      :::
      :::  Replace - By default this is the string to be used as a replacement for
      :::            each found search expression. Full support is provided for
      :::            substituion patterns available to the JScript replace method.
      :::            A $ literal can be escaped as $$. An empty replacement string
      :::            must be represented as "".
      :::
      :::            Replace substitution pattern syntax is documented at
      :::            http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx
      :::
      :::  Options - An optional string of characters used to alter the behavior
      :::            of REPL. The option characters are case insensitive, and may
      :::            appear in any order.
      :::
      :::            I - Makes the search case-insensitive.
      :::
      :::            L - The Search is treated as a string literal instead of a
      :::                regular expression. Also, all $ found in Replace are
      :::                treated as $ literals.
      :::
      :::            E - Search and Replace represent the name of environment
      :::                variables that contain the respective values. An undefined
      :::                variable is treated as an empty string.
      :::
      :::            M - Multi-line mode. The entire contents of stdin is read and
      :::                processed in one pass instead of line by line. ^ anchors
      :::                the beginning of a line and $ anchors the end of a line.
      :::
      :::            X - Enables extended substitution pattern syntax with support
      :::                for the following escape sequences:
      :::
      :::                \\     -  Backslash
      :::                \b     -  Backspace
      :::                \f     -  Formfeed
      :::                \n     -  Newline
      :::                \r     -  Carriage Return
      :::                \t     -  Horizontal Tab
      :::                \v     -  Vertical Tab
      :::                \xnn   -  Ascii (Latin 1) character expressed as 2 hex digits
      :::                \unnnn -  Unicode character expressed as 4 hex digits
      :::
      :::                Escape sequences are supported even when the L option is used.
      :::
      :::            S - The source is read from an environment variable instead of
      :::                from stdin. The name of the source environment variable is
      :::                specified in the next argument after the option string.
      :::
      
      ::************ Batch portion ***********
      @echo off
      if .%2 equ . (
        if "%~1" equ "/?" (
          findstr "^:::" "%~f0" | cscript //E:JScript //nologo "%~f0" "^:::" ""
          exit /b 0
        ) else (
          call :err "Insufficient arguments"
          exit /b 1
        )
      )
      echo(%~3|findstr /i "[^SMILEX]" >nul && (
        call :err "Invalid option(s)"
        exit /b 1
      )
      cscript //E:JScript //nologo "%~f0" %*
      exit /b 0
      
      :err
      >&2 echo ERROR: %~1. Use REPL /? to get help.
      exit /b
      
      ************* JScript portion **********/
      var env=WScript.CreateObject("WScript.Shell").Environment("Process");
      var args=WScript.Arguments;
      var search=args.Item(0);
      var replace=args.Item(1);
      var options="g";
      if (args.length>2) {
        options+=args.Item(2).toLowerCase();
      }
      var multi=(options.indexOf("m")>=0);
      var srcVar=(options.indexOf("s")>=0);
      if (srcVar) {
        options=options.replace(/s/g,"");
      }
      if (options.indexOf("e")>=0) {
        options=options.replace(/e/g,"");
        search=env(search);
        replace=env(replace);
      }
      if (options.indexOf("l")>=0) {
        options=options.replace(/l/g,"");
        search=search.replace(/([.^$*+?()[{\\|])/g,"\\$1");
        replace=replace.replace(/\$/g,"$$$$");
      }
      if (options.indexOf("x")>=0) {
        options=options.replace(/x/g,"");
        replace=replace.replace(/\\\\/g,"\\B");
        replace=replace.replace(/\\b/g,"\b");
        replace=replace.replace(/\\f/g,"\f");
        replace=replace.replace(/\\n/g,"\n");
        replace=replace.replace(/\\r/g,"\r");
        replace=replace.replace(/\\t/g,"\t");
        replace=replace.replace(/\\v/g,"\v");
        replace=replace.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
          function($0,$1,$2){
            return String.fromCharCode(parseInt("0x"+$0.substring(2)));
          }
        );
        replace=replace.replace(/\\B/g,"\\");
      }
      var search=new RegExp(search,options);
      
      if (srcVar) {
        WScript.Stdout.Write(env(args.Item(3)).replace(search,replace));
      } else {
        while (!WScript.StdIn.AtEndOfStream) {
          if (multi) {
            WScript.Stdout.Write(WScript.StdIn.ReadAll().replace(search,replace));
          } else {
            WScript.Stdout.WriteLine(WScript.StdIn.ReadLine().replace(search,replace));
          }
        }
      }
      

      【讨论】:

        猜你喜欢
        • 2014-09-24
        • 1970-01-01
        • 2016-03-20
        • 1970-01-01
        • 2019-12-19
        • 2017-10-05
        • 1970-01-01
        • 2016-03-16
        • 2019-07-18
        相关资源
        最近更新 更多