使用 nom 解析带有反斜杠转义的单引号的单引号字符串答案

【问题标题】：Parsing single-quoted strings with backslash-escaped single quotes with nom使用 nom 解析带有反斜杠转义的单引号的单引号字符串
【发布时间】：2021-06-20 16:39:11
【问题描述】：

这是Parsing single-quoted string with escaped quotes with Nom 5 和Parse string with escaped single quotes 的变体。我想将像'1 \' 2 \ 3 \\ 4'（一个原始字符序列）这样的字符串解析为"1 \\' 2 \\ 3 \\\\ 4"（一个Rust字符串），所以除了在字符串中可能有\'之外，我不关心任何转义。尝试使用链接问题中的代码：

use nom::{
  branch::alt,
  bytes::complete::{escaped, tag},
  character::complete::none_of,
  combinator::recognize,
  multi::{many0, separated_list0},
  sequence::delimited,
  IResult,
};

fn parse_quoted_1(input: &str) -> IResult<&str, &str> {
  delimited(
    tag("'"),
    alt((escaped(none_of("\\\'"), '\\', tag("'")), tag(""))),
    tag("'"),
  )(input)
}

fn parse_quoted_2(input: &str) -> IResult<&str, &str> {
  delimited(
    tag("'"),
    recognize(separated_list0(tag("\\'"), many0(none_of("'")))),
    tag("'"),
  )(input)
}

fn main() {
  println!("{:?}", parse_quoted_1(r#"'1'"#));
  println!("{:?}", parse_quoted_2(r#"'1'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3 \\ 4'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3 \\ 4'"#));
}

/*
Ok(("", "1"))
Ok(("", "1"))
Ok(("", "1 \\' 2"))
Ok((" 2'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3'", code: Tag }))
Ok((" 2 \\ 3'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3 \\\\ 4'", code: Tag }))
Ok((" 2 \\ 3 \\\\ 4'", "1 \\"))
*/

只有前 3 个案例按预期工作。

【问题讨论】：

请提供每个 println 的预期输出
> 我想将'1 \' 2 \ 3 \\ 4'（原始字符序列）之类的字符串解析为"1 \\' 2 \\ 3 \\\\ 4"（Rust 字符串）
我想我可以通过手动循环输入等来做到这一点，但也许有一种很好的组合方式。
你可以做string.replace(r#"\"#, r#"\\"#);
我正在解析几个 GB 的文件，这样的字符串只是其语法中的一种情况，所以由于 input 是几个 GB，我不能在其上预先设置replace，它是不忠实的w.r.t。无论如何其他语法。

标签： parsing rust nom

【解决方案1】：

一个不太好的/必要的解决方案：

use nom::{bytes::complete::take, character::complete::char, sequence::delimited, IResult};

fn parse_quoted(input: &str) -> IResult<&str, &str> {
  fn escaped(input: &str) -> IResult<&str, &str> {
    let mut pc = 0 as char;
    let mut n = 0;
    for (i, c) in input.chars().enumerate() {
      if c == '\'' && pc != '\\' {
        break;
      }
      pc = c;
      n = i + 1;
    }
    take(n)(input)
  }
  delimited(char('\''), escaped, char('\''))(input)
}

fn main() {
  println!("{:?}", parse_quoted(r#"'' ..."#));
  println!("{:?}", parse_quoted(r#"'1' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3 \\ 4' ..."#));
}

/*
Ok((" ...", ""))
Ok((" ...", "1"))
Ok((" ...", "1 \\' 2"))
Ok((" ...", "1 \\' 2 \\ 3"))
Ok((" ...", "1 \\' 2 \\ 3 \\\\ 4"))
*/

为了允许'...\\'，我们可以类似地存储更多以前的字符：

    let mut pc = 0 as char;
    let mut ppc = 0 as char;
    let mut pppc = 0 as char;
    let mut n = 0;
    for (i, c) in input.chars().enumerate() {
      if (c == '\'' && pc != '\\') || (c == '\'' && pc == '\\' && ppc == '\\' && pppc != '\\') {
        break;
      }
      pppc = ppc;
      ppc = pc;
      pc = c;
      n = i + 1;
    }

【讨论】：

【解决方案2】：

这是我解析引用字符串的方法。

当没有需要转义的字符串或没有转义斜杠的字符串复制时，它参考原始字符串返回Cow类型。

您可能需要根据需要调整 is_gdtext 和 is_quited_char。

// is valid character that do not require escaping
fn is_qdtext(chr: char) -> bool {
    match chr {
        '\t' => true,
        ' ' => true,
        '!' => true,
        '#'..='[' => true,
        ']'..='~' => true,
        _ => {
            let x = chr as u8;
            x >= 0x80
        }
    }
}

// check if character can be escaped
fn is_quoted_char(chr: char) -> bool {
    match chr {
        ' '..='~' => true,
        '\t' => true,
        _ => {
            let x = chr as u8;
            x >= 0x80
        }
    }
}

/// parse single escaped character
fn parse_quoted_pair(data: &str) -> IResult<&str, char> {
    let (data, (_, chr)) = pair(tag("\\"), satisfy(is_quoted_char))(data)?;
    Ok((data, chr))
}

// parse content of quoted string
fn parse_quoted_content(data: &str) -> IResult<&str, Cow<'_, str>> {
    let (mut data, content) = data.split_at_position_complete(|item| !is_qdtext(item))?;

    if data.chars().next() == Some('\\') {
        // we need to escape some characters
        let mut content = content.to_string();
        while data.chars().next() == Some('\\') {
            // unescape next char
            let (next_data, chr) = parse_quoted_pair(data)?;
            content.push(chr);
            data = next_data;

            // parse next plain text chunk
            let (next_data, extra_content) =
                data.split_at_position_complete(|item| !is_qdtext(item))?;
            content.push_str(extra_content);
            data = next_data;
        }
        Ok((data, Cow::Owned(content)))
    } else {
        // quick version, there is no characters to escape
        Ok((data, Cow::Borrowed(content)))
    }
}

fn parse_quoted_string(data: &str) -> IResult<&str, Cow<'_, str>> {
    let (data, (_, content, _)) = tuple((tag("'"), parse_quoted_content, tag("'")))(data)?;

    Ok((data, content))
}

【讨论】：