【问题标题】:Parsing single-quoted strings with backslash-escaped single quotes with nom使用 nom 解析带有反斜杠转义的单引号的单引号字符串
【发布时间】:2021-06-20 16:39:11
【问题描述】:

这是Parsing single-quoted string with escaped quotes with Nom 5Parse string with escaped single quotes 的变体。我想将像'1 \' 2 \ 3 \\ 4'(一个原始字符序列)这样的字符串解析为"1 \\' 2 \\ 3 \\\\ 4"(一个Rust字符串),所以除了在字符串中可能有\'之外,我不关心任何转义。尝试使用链接问题中的代码:

use nom::{
  branch::alt,
  bytes::complete::{escaped, tag},
  character::complete::none_of,
  combinator::recognize,
  multi::{many0, separated_list0},
  sequence::delimited,
  IResult,
};

fn parse_quoted_1(input: &str) -> IResult<&str, &str> {
  delimited(
    tag("'"),
    alt((escaped(none_of("\\\'"), '\\', tag("'")), tag(""))),
    tag("'"),
  )(input)
}

fn parse_quoted_2(input: &str) -> IResult<&str, &str> {
  delimited(
    tag("'"),
    recognize(separated_list0(tag("\\'"), many0(none_of("'")))),
    tag("'"),
  )(input)
}

fn main() {
  println!("{:?}", parse_quoted_1(r#"'1'"#));
  println!("{:?}", parse_quoted_2(r#"'1'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3'"#));
  println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3 \\ 4'"#));
  println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3 \\ 4'"#));
}

/*
Ok(("", "1"))
Ok(("", "1"))
Ok(("", "1 \\' 2"))
Ok((" 2'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3'", code: Tag }))
Ok((" 2 \\ 3'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3 \\\\ 4'", code: Tag }))
Ok((" 2 \\ 3 \\\\ 4'", "1 \\"))
*/

只有前 3 个案例按预期工作。

【问题讨论】:

  • 请提供每个 println 的预期输出
  • > 我想将'1 \' 2 \ 3 \\ 4'(原始字符序列)之类的字符串解析为"1 \\' 2 \\ 3 \\\\ 4"(Rust 字符串)
  • 我想我可以通过手动循环输入等来做到这一点,但也许有一种很好的组合方式。
  • 你可以做string.replace(r#"\"#, r#"\\"#);
  • 我正在解析几个 GB 的文件,这样的字符串只是其语法中的一种情况,所以由于 input 是几个 GB,我不能在其上预先设置replace,它是不忠实的w.r.t。无论如何其他语法。

标签: parsing rust nom


【解决方案1】:

一个不太好的/必要的解决方案:

use nom::{bytes::complete::take, character::complete::char, sequence::delimited, IResult};

fn parse_quoted(input: &str) -> IResult<&str, &str> {
  fn escaped(input: &str) -> IResult<&str, &str> {
    let mut pc = 0 as char;
    let mut n = 0;
    for (i, c) in input.chars().enumerate() {
      if c == '\'' && pc != '\\' {
        break;
      }
      pc = c;
      n = i + 1;
    }
    take(n)(input)
  }
  delimited(char('\''), escaped, char('\''))(input)
}

fn main() {
  println!("{:?}", parse_quoted(r#"'' ..."#));
  println!("{:?}", parse_quoted(r#"'1' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3' ..."#));
  println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3 \\ 4' ..."#));
}

/*
Ok((" ...", ""))
Ok((" ...", "1"))
Ok((" ...", "1 \\' 2"))
Ok((" ...", "1 \\' 2 \\ 3"))
Ok((" ...", "1 \\' 2 \\ 3 \\\\ 4"))
*/

为了允许'...\\',我们可以类似地存储更多以前的字符:

    let mut pc = 0 as char;
    let mut ppc = 0 as char;
    let mut pppc = 0 as char;
    let mut n = 0;
    for (i, c) in input.chars().enumerate() {
      if (c == '\'' && pc != '\\') || (c == '\'' && pc == '\\' && ppc == '\\' && pppc != '\\') {
        break;
      }
      pppc = ppc;
      ppc = pc;
      pc = c;
      n = i + 1;
    }

【讨论】:

    【解决方案2】:

    这是我解析引用字符串的方法。

    当没有需要转义的字符串或没有转义斜杠的字符串复制时,它参考原始字符串返回Cow类型。

    您可能需要根据需要调整 is_gdtextis_quited_char

    // is valid character that do not require escaping
    fn is_qdtext(chr: char) -> bool {
        match chr {
            '\t' => true,
            ' ' => true,
            '!' => true,
            '#'..='[' => true,
            ']'..='~' => true,
            _ => {
                let x = chr as u8;
                x >= 0x80
            }
        }
    }
    
    // check if character can be escaped
    fn is_quoted_char(chr: char) -> bool {
        match chr {
            ' '..='~' => true,
            '\t' => true,
            _ => {
                let x = chr as u8;
                x >= 0x80
            }
        }
    }
    
    /// parse single escaped character
    fn parse_quoted_pair(data: &str) -> IResult<&str, char> {
        let (data, (_, chr)) = pair(tag("\\"), satisfy(is_quoted_char))(data)?;
        Ok((data, chr))
    }
    
    // parse content of quoted string
    fn parse_quoted_content(data: &str) -> IResult<&str, Cow<'_, str>> {
        let (mut data, content) = data.split_at_position_complete(|item| !is_qdtext(item))?;
    
        if data.chars().next() == Some('\\') {
            // we need to escape some characters
            let mut content = content.to_string();
            while data.chars().next() == Some('\\') {
                // unescape next char
                let (next_data, chr) = parse_quoted_pair(data)?;
                content.push(chr);
                data = next_data;
    
                // parse next plain text chunk
                let (next_data, extra_content) =
                    data.split_at_position_complete(|item| !is_qdtext(item))?;
                content.push_str(extra_content);
                data = next_data;
            }
            Ok((data, Cow::Owned(content)))
        } else {
            // quick version, there is no characters to escape
            Ok((data, Cow::Borrowed(content)))
        }
    }
    
    fn parse_quoted_string(data: &str) -> IResult<&str, Cow<'_, str>> {
        let (data, (_, content, _)) = tuple((tag("'"), parse_quoted_content, tag("'")))(data)?;
    
        Ok((data, content))
    }
    

    【讨论】:

      猜你喜欢
      • 2010-10-13
      • 2011-11-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-02-19
      • 1970-01-01
      相关资源
      最近更新 更多