仅当分隔符未包含在引号中时才拆分字符串答案

【问题标题】：Splitting a string only when the delimeter is not enclosed in quotation marks仅当分隔符未包含在引号中时才拆分字符串
【发布时间】：2009-12-30 19:10:09
【问题描述】：

我需要在 JavaScript 中编写一个拆分函数，将字符串拆分为数组，用逗号分隔...但逗号不能用引号引起来（' 和 "）。

以下是三个示例以及结果（数组）应该如何：

"peanut, butter, jelly"
  -> ["peanut", "butter", "jelly"]

"peanut, 'butter, bread', 'jelly'"
  -> ["peanut", "butter, bread", "jelly"]

'peanut, "butter, bread", "jelly"'
  -> ["peanut", 'butter, bread', "jelly"]

我不能使用 JavaScript 的 split 方法的原因是当分隔符用引号括起来时它也会分裂。

我怎样才能做到这一点，也许使用正则表达式？

关于上下文，我将使用它来拆分从第三个参数的第三个元素传递给您在扩展 jQuery 的$.expr[':'] 时创建的函数的参数。通常，此参数的名称称为meta，它是一个包含有关过滤器的某些信息的数组。

无论如何，这个数组的第三个元素是一个字符串，其中包含通过过滤器传递的参数；并且由于参数为字符串格式，因此我需要能够正确拆分它们以进行解析。

【问题讨论】：

你能控制整个集合以确保所有元素都包含在单引号内，并且它们本身不会包含任何单引号吗？
关于这个问题的更多上下文会很有趣。看起来您正在尝试从字符串中解析 JavaScript，或者实际上是 JSON。可能有比 RegEx 和拆分更好的方法来解决这个问题，即使在像这样解析数组的最简单情况下也是如此。
正则表达式是错误的工具，正如已经讨论过很多次......

标签： javascript regex

【解决方案1】：

您所要求的本质上是一个 Javascript CSV 解析器。在“Javascript CSV Parser”上进行谷歌搜索，你会得到很多点击，其中很多都有完整的脚本。另见Javascript code to parse CSV data

【讨论】：

暂且不论，对于这个问题，“lexer”比“parser”更合适。

【解决方案2】：

好吧，我已经编写了一个解决方案的手提钻（为其他东西编写的通用代码），所以只是为了踢球。 . .

function Lexer () {
  this.setIndex = false;
  this.useNew = false;
  for (var i = 0; i < arguments.length; ++i) {
    var arg = arguments [i];
    if (arg === Lexer.USE_NEW) {
      this.useNew = true;
    }
    else if (arg === Lexer.SET_INDEX) {
      this.setIndex = Lexer.DEFAULT_INDEX;
    }
    else if (arg instanceof Lexer.SET_INDEX) {
      this.setIndex = arg.indexProp;
    }
  }
  this.rules = [];
  this.errorLexeme = null;
}

Lexer.NULL_LEXEME = {};

Lexer.ERROR_LEXEME = { 
  toString: function () {
    return "[object Lexer.ERROR_LEXEME]";
  }
};

Lexer.DEFAULT_INDEX = "index";

Lexer.USE_NEW = {};

Lexer.SET_INDEX = function (indexProp) {
  if ( !(this instanceof arguments.callee)) {
    return new arguments.callee.apply (this, arguments);
  }
  if (indexProp === undefined) {
    indexProp = Lexer.DEFAULT_INDEX;
  }
  this.indexProp = indexProp;
};

(function () {
  var New = (function () {
    var fs = [];
    return function () {
      var f = fs [arguments.length];
      if (f) {
        return f.apply (this, arguments);
      }
      var argStrs = [];
      for (var i = 0; i < arguments.length; ++i) {
        argStrs.push ("a[" + i + "]");
      }
      f = new Function ("var a=arguments;return new this(" + argStrs.join () + ");");
      if (arguments.length < 100) {
        fs [arguments.length] = f;
      }
      return f.apply (this, arguments);
    };
  }) ();

  var flagMap = [
      ["global", "g"]
    , ["ignoreCase", "i"]
    , ["multiline", "m"]
    , ["sticky", "y"]
    ];

  function getFlags (regex) {
    var flags = "";
    for (var i = 0; i < flagMap.length; ++i) {
      if (regex [flagMap [i] [0]]) {
        flags += flagMap [i] [1];
      }
    }
    return flags;
  }

  function not (x) {
    return function (y) {
      return x !== y;
    };
  }

  function Rule (regex, lexeme) {
    if (!regex.global) {
      var flags = "g" + getFlags (regex);
      regex = new RegExp (regex.source, flags);
    }
    this.regex = regex;
    this.lexeme = lexeme;
  }

  Lexer.prototype = {
      constructor: Lexer

    , addRule: function (regex, lexeme) {
        var rule = new Rule (regex, lexeme);
        this.rules.push (rule);
      }

    , setErrorLexeme: function (lexeme) {
        this.errorLexeme = lexeme;
      }

    , runLexeme: function (lexeme, exec) {
        if (typeof lexeme !== "function") {
          return lexeme;
        }
        var args = exec.concat (exec.index, exec.input);
        if (this.useNew) {
          return New.apply (lexeme, args);
        }
        return lexeme.apply (null, args);
      }

    , lex: function (str) {
        var index = 0;
        var lexemes = [];
        if (this.setIndex) {
          lexemes.push = function () {
            for (var i = 0; i < arguments.length; ++i) {
              if (arguments [i]) {
                arguments [i] [this.setIndex] = index;
              }
            }
            return Array.prototype.push.apply (this, arguments);
          };
        }
        while (index < str.length) {
          var bestExec = null;
          var bestRule = null;
          for (var i = 0; i < this.rules.length; ++i) {
            var rule = this.rules [i];
            rule.regex.lastIndex = index;
            var exec = rule.regex.exec (str);
            if (exec) {
              var doUpdate = !bestExec 
                || (exec.index < bestExec.index)
                || (exec.index === bestExec.index && exec [0].length > bestExec [0].length)
                ;
              if (doUpdate) {
                bestExec = exec;
                bestRule = rule;
              }
            }
          }
          if (!bestExec) {
            if (this.errorLexeme) {
              lexemes.push (this.errorLexeme);
              return lexemes.filter (not (Lexer.NULL_LEXEME));
            }
            ++index;
          }
          else {
            if (this.errorLexeme && index !== bestExec.index) {
              lexemes.push (this.errorLexeme);
            }
            var lexeme = this.runLexeme (bestRule.lexeme, bestExec);
            lexemes.push (lexeme);
          }
          index = bestRule.regex.lastIndex;
        }
        return lexemes.filter (not (Lexer.NULL_LEXEME));
      }
  };
}) ();

if (!Array.prototype.filter) {
  Array.prototype.filter = function (fun) {
    var len = this.length >>> 0;
    var res = [];
    var thisp = arguments [1];
    for (var i = 0; i < len; ++i) {
      if (i in this) {
        var val = this [i];
        if (fun.call (thisp, val, i, this)) {
          res.push (val);
        }
      }
    }
    return res;
  };
}

现在使用代码解决您的问题：

function trim (str) {
  str = str.replace (/^\s+/, "");
  str = str.replace (/\s+$/, "");
  return str;
}

var splitter = new Lexer ();
splitter.setErrorLexeme (Lexer.ERROR_LEXEME);
splitter.addRule (/[^,"]*"[^"]*"[^,"]*/g, trim);
splitter.addRule (/[^,']*'[^']*'[^,']*/g, trim);
splitter.addRule (/[^,"']+/g, trim);
splitter.addRule (/,/g, Lexer.NULL_LEXEME);

var strs = [
    "peanut, butter, jelly"
  , "peanut, 'butter, bread', 'jelly'"
  , 'peanut, "butter, bread", "jelly"'
  ];

// NOTE: I'm lazy here, so I'm using Array.prototype.map, 
//       which isn't supported in all browsers.
var splitStrs = strs.map (function (str) {
  return splitter.lex (str);
});

【讨论】：

【解决方案3】：

var str = 'text, foo, "haha, dude", bar';
var fragments = str.match(/[a-z]+|(['"]).*?\1/g);

更好（支持 strings 中的转义 " 或 '）：

var str = 'text_123 space, foo, "text, here\", dude", bar, \'one, two\', blob';
var fragments = str.match(/[^"', ][^"',]+[^"', ]|(["'])(?:[^\1\\\\]|\\\\.)*\1/g);

// Result:
0: text_123 space
1: foo
2: "text, here\", dude"
3: bar
4: 'one, two'
5: blob

【讨论】：

不处理交换的' 和" 分隔符，非字母字符。
我用一个简单的测试运行它，"a,b"，它返回了null。显然它不会提取少于 3 个字符的单词
其他一些问题：1) \1 不作为字符类中的组引用（它只匹配1）； 2) 您只需要在正则表达式中使用两个反斜杠来匹配文字反斜杠，而不是四个； 3) 要将反斜杠引号放在目标字符串中，您必须在字符串文字中使用\\"，而不是\"； 4) 除非发起者另有说明，否则空格应在 CSV 数据中被视为重要（这意味着您的正则表达式的第一部分应该只是 ["',\s]+）。

【解决方案4】：

如果您可以控制输入以强制将字符串包含在双引号 " 中，并且包含字符串的所有元素都将包含在单引号 ' 中，并且任何元素都不能包含单个-quote，然后您可以拆分, '。如果您无法控制输入，那么使用正则表达式对输入进行排序/过滤/拆分与使用正则表达式匹配 xhtml 一样有用（请参阅：RegEx match open tags except XHTML self-contained tags）

【讨论】：

我看不出你链接到的线程与这个有什么关系。好的，在这种情况下我不会使用正则表达式，但这里面临的问题与尝试使用正则表达式解析 (x)html 没什么关系。由于 (x)html 的递归性质，那无法完成，但这个问题根本与此无关。似乎在每个带有“regex”一词的线程中，线程 (#1732348) 都发布了一个链接到...
关键是如果你不确定你的输入可能包含什么，那么就没有正则表达式可以正确解析它。
我的意思是，#1732348 帖子与这篇文章关系不大。事实上，您并不确切知道您的输入是什么，这正是正则表达式的含义：您定义了可能变化的模式。