【问题标题】:How can I case fold a string in Rust?如何在 Rust 中折叠字符串?
【发布时间】:2016-10-25 22:49:14
【问题描述】:

我正在编写一个简单的全文搜索库,需要大小写折叠来检查两个单词是否相等。对于这个用例,现有的.to_lowercase().to_uppercase() 方法是not enough

通过对 crates.io 的快速搜索,我可以找到用于规范化和分词但不是大小写折叠的库。 regex-syntax 确实有 case folding code,但它没有在其 API 中公开。

【问题讨论】:

标签: unicode rust


【解决方案1】:

对于我的用例,我发现caseless crate 最有用。

据我所知,这是唯一支持规范化的库。这在您需要时很重要,例如"㎒" (U+3392 SQUARE MHZ) 和 "mhz" 匹配。有关其工作原理的详细信息,请参阅 Unicode 标准中的 Chapter 3 - Default Caseless Matching

下面是一些不区分大小写匹配字符串的示例代码:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

直接获取大小写折叠字符串,可以使用default_case_fold_str函数:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless 也没有公开相应的规范化函数,但您可以使用 unicode-normalization crate 编写一个:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

请注意,正确的结果需要多轮归一化和大小写折叠。

(感谢 BurntSushi5 将我指向这个库。)

【讨论】:

  • 这个答案是五年前的。你今天会做不同的事情吗?
【解决方案2】:

unicase crate 不直接公开大小写折叠,但它提供了一个通用包装器类型,以不区分大小写的方式实现EqOrdHash。 master 分支(未发布)同时支持 ASCII 大小写折叠(作为一种优化)和 Unicode 大小写折叠(尽管只支持不变大小写折叠)。

【讨论】:

    【解决方案3】:

    如果有人确实想坚持使用标准库,我想要一些实际数据 对此。我提取了两个字节字符的the full list 失败了 to_lowercaseto_uppercase。然后我运行了这个测试:

    fn lowercase(left: char, right: char) -> bool {
       for c in left.to_lowercase() {
          for d in right.to_lowercase() {
             if c == d { return true }
          }
       }
       false
    }
    
    fn uppercase(left: char, right: char) -> bool {
       for c in left.to_uppercase() {
          for d in right.to_uppercase() {
             if c == d { return true }
          }
       }
       false
    }
    
    fn main() {
       let pairs = &[
          &['\u{00E5}','\u{212B}'],&['\u{00C5}','\u{212B}'],&['\u{0399}','\u{1FBE}'],
          &['\u{03B9}','\u{1FBE}'],&['\u{03B2}','\u{03D0}'],&['\u{03B5}','\u{03F5}'],
          &['\u{03B8}','\u{03D1}'],&['\u{03B8}','\u{03F4}'],&['\u{03D1}','\u{03F4}'],
          &['\u{03B9}','\u{1FBE}'],&['\u{0345}','\u{03B9}'],&['\u{0345}','\u{1FBE}'],
          &['\u{03BA}','\u{03F0}'],&['\u{00B5}','\u{03BC}'],&['\u{03C0}','\u{03D6}'],
          &['\u{03C1}','\u{03F1}'],&['\u{03C2}','\u{03C3}'],&['\u{03C6}','\u{03D5}'],
          &['\u{03C9}','\u{2126}'],&['\u{0392}','\u{03D0}'],&['\u{0395}','\u{03F5}'],
          &['\u{03D1}','\u{03F4}'],&['\u{0398}','\u{03D1}'],&['\u{0398}','\u{03F4}'],
          &['\u{0345}','\u{1FBE}'],&['\u{0345}','\u{0399}'],&['\u{0399}','\u{1FBE}'],
          &['\u{039A}','\u{03F0}'],&['\u{00B5}','\u{039C}'],&['\u{03A0}','\u{03D6}'],
          &['\u{03A1}','\u{03F1}'],&['\u{03A3}','\u{03C2}'],&['\u{03A6}','\u{03D5}'],
          &['\u{03A9}','\u{2126}'],&['\u{0398}','\u{03F4}'],&['\u{03B8}','\u{03F4}'],
          &['\u{03B8}','\u{03D1}'],&['\u{0398}','\u{03D1}'],&['\u{0432}','\u{1C80}'],
          &['\u{0434}','\u{1C81}'],&['\u{043E}','\u{1C82}'],&['\u{0441}','\u{1C83}'],
          &['\u{0442}','\u{1C84}'],&['\u{0442}','\u{1C85}'],&['\u{1C84}','\u{1C85}'],
          &['\u{044A}','\u{1C86}'],&['\u{0412}','\u{1C80}'],&['\u{0414}','\u{1C81}'],
          &['\u{041E}','\u{1C82}'],&['\u{0421}','\u{1C83}'],&['\u{1C84}','\u{1C85}'],
          &['\u{0422}','\u{1C84}'],&['\u{0422}','\u{1C85}'],&['\u{042A}','\u{1C86}'],
          &['\u{0463}','\u{1C87}'],&['\u{0462}','\u{1C87}']
       ];
    
       let (mut upper, mut lower) = (0, 0);
    
       for pair in pairs.iter() {
          print!("U+{:04X} ", pair[0] as u32);
          print!("U+{:04X} pass: ", pair[1] as u32);
          if uppercase(pair[0], pair[1]) {
             print!("to_uppercase ");
             upper += 1;
          } else {
             print!("             ");
          }
          if lowercase(pair[0], pair[1]) {
             print!("to_lowercase");
             lower += 1;
          }
          println!();
       }
    
       println!("upper pass: {}, lower pass: {}", upper, lower);
    }
    

    结果如下。有趣的是,其中一对都失败了。但基于此, to_uppercase 是最好的选择

    U+00E5 U+212B pass:              to_lowercase
    U+00C5 U+212B pass:              to_lowercase
    U+0399 U+1FBE pass: to_uppercase
    U+03B9 U+1FBE pass: to_uppercase
    U+03B2 U+03D0 pass: to_uppercase
    U+03B5 U+03F5 pass: to_uppercase
    U+03B8 U+03D1 pass: to_uppercase
    U+03B8 U+03F4 pass:              to_lowercase
    U+03D1 U+03F4 pass:
    U+03B9 U+1FBE pass: to_uppercase
    U+0345 U+03B9 pass: to_uppercase
    U+0345 U+1FBE pass: to_uppercase
    U+03BA U+03F0 pass: to_uppercase
    U+00B5 U+03BC pass: to_uppercase
    U+03C0 U+03D6 pass: to_uppercase
    U+03C1 U+03F1 pass: to_uppercase
    U+03C2 U+03C3 pass: to_uppercase
    U+03C6 U+03D5 pass: to_uppercase
    U+03C9 U+2126 pass:              to_lowercase
    U+0392 U+03D0 pass: to_uppercase
    U+0395 U+03F5 pass: to_uppercase
    U+03D1 U+03F4 pass:
    U+0398 U+03D1 pass: to_uppercase
    U+0398 U+03F4 pass:              to_lowercase
    U+0345 U+1FBE pass: to_uppercase
    U+0345 U+0399 pass: to_uppercase
    U+0399 U+1FBE pass: to_uppercase
    U+039A U+03F0 pass: to_uppercase
    U+00B5 U+039C pass: to_uppercase
    U+03A0 U+03D6 pass: to_uppercase
    U+03A1 U+03F1 pass: to_uppercase
    U+03A3 U+03C2 pass: to_uppercase
    U+03A6 U+03D5 pass: to_uppercase
    U+03A9 U+2126 pass:              to_lowercase
    U+0398 U+03F4 pass:              to_lowercase
    U+03B8 U+03F4 pass:              to_lowercase
    U+03B8 U+03D1 pass: to_uppercase
    U+0398 U+03D1 pass: to_uppercase
    U+0432 U+1C80 pass: to_uppercase
    U+0434 U+1C81 pass: to_uppercase
    U+043E U+1C82 pass: to_uppercase
    U+0441 U+1C83 pass: to_uppercase
    U+0442 U+1C84 pass: to_uppercase
    U+0442 U+1C85 pass: to_uppercase
    U+1C84 U+1C85 pass: to_uppercase
    U+044A U+1C86 pass: to_uppercase
    U+0412 U+1C80 pass: to_uppercase
    U+0414 U+1C81 pass: to_uppercase
    U+041E U+1C82 pass: to_uppercase
    U+0421 U+1C83 pass: to_uppercase
    U+1C84 U+1C85 pass: to_uppercase
    U+0422 U+1C84 pass: to_uppercase
    U+0422 U+1C85 pass: to_uppercase
    U+042A U+1C86 pass: to_uppercase
    U+0463 U+1C87 pass: to_uppercase
    U+0462 U+1C87 pass: to_uppercase
    upper pass: 46, lower pass: 8
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-03-08
      • 2023-03-15
      • 2017-05-04
      • 2015-03-16
      • 1970-01-01
      • 2014-02-16
      • 2015-05-13
      • 2014-12-25
      相关资源
      最近更新 更多