【问题标题】:How do I remove diacritics (accents) from a string in .NET?如何从 .NET 中的字符串中删除变音符号(重音符号)?
【发布时间】:2010-09-19 22:19:12
【问题描述】:

我正在尝试转换一些法语加拿大语的字符串,基本上,我希望能够在保留字母的同时去掉字母中的法语重音符号。 (例如,将é 转换为e,因此crème brûlée 将变为creme brulee

实现这一目标的最佳方法是什么?

【问题讨论】:

  • 警告:这种方法在某些特定情况下可能有效,但通常不能只删除变音符号。在某些情况下和某些语言中,这可能会改变文本的含义。你没有说你为什么要这样做;如果是为了比较字符串或搜索,最好使用支持 unicode 的库。
  • 由于实现这一点的大多数技术都依赖于 Unicode 规范化,因此描述该标准的文档可能对阅读有用:unicode.org/reports/tr15
  • 我认为Azure团队修复了这个问题,我尝试上传一个名为“Mémo de la réunion.pdf”的文件,操作成功。
  • 在我们的例子中,限制来自 Postgres 数据库中的 ltree 数据类型。其中 ltree 只允许[a-zA-Z0-9_]。对于我们的案例,确实有必要进行快速搜索。

标签: .net string diacritics


【解决方案1】:

我没有使用过这种方法,但 Michael Kaplan 在他的博客文章(标题令人困惑)中描述了一种这样做的方法,其中谈到了剥离变音符号:Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

请注意,这是他之前帖子的后续:Stripping diacritics....

该方法使用String.Normalize 将输入字符串拆分为组成字形(基本上将“基本”字符与变音符号分开),然后扫描结果并仅保留基本字符。这只是有点复杂,但实际上你正在研究一个复杂的问题。

当然,如果您限制自己使用法语,您可能会按照@David Dibben 的建议使用 How to remove accents and tilde in a C++ std::string 中基于表格的简单方法。

【讨论】:

  • 这是错误的。德语字符 ä 和 ö 和 ü 拉丁化为 ae ue 和 oe,而不是 a, o u ...
  • 另外,波兰语字母 ł 被忽略。
  • Norse ø 也被忽略
  • @StefanSteiger 你知道,在捷克语中有 áčěů 之类的字母,我们通常将其“拉丁化”为 aceu,尽管它听起来不同,并且可能会在诸如“hrábě”/hra:bje/ 之类的词中引起混淆、“hrabě”/hrabje/ 和“hrabe”/hrabe/。对我来说,删除变音符号似乎是一个纯粹的图形问题,与字母的语音或历史无关。像ä ö ü 这样的字母是通过在基本字母中添加上标“e”来创建的,因此“ae”分解在历史上是有意义的。这取决于目标 - 删除图形标记,或将字母分解为 ASCII 字符。
  • 此函数与语言无关。它不知道字符串是德语还是其他语言。如果我们考虑到在德语文本中用 oe 替换 ö 是可以的,但是用土耳其语这样做没有任何意义,那么我们会发现,如果不检测语言,这个问题实际上是无法解决的。
【解决方案2】:

如果有人感兴趣,我正在寻找类似的东西并结束了以下内容:

public static string NormalizeStringForUrl(string name)
{
    String normalizedString = name.Normalize(NormalizationForm.FormD);
    StringBuilder stringBuilder = new StringBuilder();

    foreach (char c in normalizedString)
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.LowercaseLetter:
            case UnicodeCategory.UppercaseLetter:
            case UnicodeCategory.DecimalDigitNumber:
                stringBuilder.Append(c);
                break;
            case UnicodeCategory.SpaceSeparator:
            case UnicodeCategory.ConnectorPunctuation:
            case UnicodeCategory.DashPunctuation:
                stringBuilder.Append('_');
                break;
        }
    }
    string result = stringBuilder.ToString();
    return String.Join("_", result.Split(new char[] { '_' }
        , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}

【讨论】:

  • 您应该将 StringBuilder 缓冲区预分配给 name.Length 以最小化内存分配开销。最后一次拆分/加入调用以删除顺序重复的 _ 很有趣。也许我们应该避免在循环中添加它们。为前一个字符设置一个标志,如果为真,则不发出一个 _。
  • 2 点非常好,如果我有时间回到这部分代码,我会重写它:)
  • 不错。除了 IDisposables 注释之外,我们可能还应该检查 c < 128,以确保我们不会拾取任何 UTF 字符 see here
  • 或者可能更有效c < 123see ASCI
【解决方案3】:

这对我有用...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

快&短!

【讨论】:

  • 这是我见过的最好的方法。
  • 我很喜欢这个解决方案,它适用于 Windows 应用商店应用程序。但是,它不适用于 Windows Phone 应用程序,因为编码 ISO-8859-8 似乎不可用。是否可以使用其他编码?
  • 这将适用于最常见的字符,但许多特殊字符,如 « »(作为单个字符)将在此过程中发生变化,但情况并非如此接受的解决方案。
  • 请注意,这不适用于 Linux 上的 .NET Core:System.ArgumentException: 'ISO-8859-8' is not a supported encoding name.
  • 如果您在 .NET Core 上,请从 nuget 安装 System.Text.Encoding.CodePages,然后调用它来注册提供程序:Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); - 完成此操作后,您可以使用 ISO- 8859-8
【解决方案4】:

这是 VB 版本(适用于希腊语):

导入 System.Text

导入 System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

【讨论】:

  • 可能是一个旧答案,但为什么要使用单独的行进行变量声明和第一次赋值?
【解决方案5】:

我经常使用基于我在这里找到的另一个版本的扩展方法 (见Replacing characters in C# (ascii)) 快速解释:

  • 规范化以形成 D 将 è 等字符拆分为 e 和非间距 `
  • 从此,nospacing 字符被删除
  • 结果被归一化为 C 形式(我不确定这是否必要)

代码:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }

    // or, alternatively
    public static string RemoveDiacritics2(this string str)
    {
        if (null == str) return null;
        var chars = str
            .Normalize(NormalizationForm.FormD)
            .ToCharArray()
            .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray();

        return new string(chars).Normalize(NormalizationForm.FormC);
    }
}

【讨论】:

    【解决方案6】:

    试试HelperSharp package

    有一个方法RemoveAccents:

     public static string RemoveAccents(this string source)
     {
         //8 bit characters 
         byte[] b = Encoding.GetEncoding(1251).GetBytes(source);
    
         // 7 bit characters
         string t = Encoding.ASCII.GetString(b);
         Regex re = new Regex("[^a-zA-Z0-9]=-_/");
         string c = re.Replace(t, " ");
         return c;
     }
    

    【讨论】:

      【解决方案7】:

      这就是我在所有 .NET 程序中将变音符号替换为非变音符号的方法

      C#:

      //Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
      public string RemoveDiacritics(string s)
      {
          string normalizedString = null;
          StringBuilder stringBuilder = new StringBuilder();
          normalizedString = s.Normalize(NormalizationForm.FormD);
          int i = 0;
          char c = '\0';
      
          for (i = 0; i <= normalizedString.Length - 1; i++)
          {
              c = normalizedString[i];
              if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
              {
                  stringBuilder.Append(c);
              }
          }
      
          return stringBuilder.ToString().ToLower();
      }
      

      VB .NET:

      'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
      Public Function RemoveDiacritics(ByVal s As String) As String
          Dim normalizedString As String
          Dim stringBuilder As New StringBuilder
          normalizedString = s.Normalize(NormalizationForm.FormD)
          Dim i As Integer
          Dim c As Char
      
          For i = 0 To normalizedString.Length - 1
              c = normalizedString(i)
              If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
                  stringBuilder.Append(c)
              End If
          Next
          Return stringBuilder.ToString().ToLower()
      End Function
      

      【讨论】:

        【解决方案8】:

        您可以使用 MMLib.Extensions nuget 包中的字符串扩展:

        using MMLib.RapidPrototyping.Generators;
        public void ExtensionsExample()
        {
          string target = "aácčeéií";
          Assert.AreEqual("aacceeii", target.RemoveDiacritics());
        } 
        

        Nuget 页面:https://www.nuget.org/packages/MMLib.Extensions/ Codeplex项目现场https://mmlib.codeplex.com/

        【讨论】:

          【解决方案9】:
          Imports System.Text
          Imports System.Globalization
          
           Public Function DECODE(ByVal x As String) As String
                  Dim sb As New StringBuilder
                  For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
                      sb.Append(c)
                  Next
                  Return sb.ToString()
              End Function
          

          【讨论】:

          • 使用 NFD 代替 NFC 会导致远远超出要求的更改。
          【解决方案10】:

          What this person said:

          Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

          它实际上将å 之类的字符拆分为@ 987654327@加上某种修饰符,然后ASCII转换去掉修饰符,只剩下a

          【讨论】:

            【解决方案11】:

            我需要一些东西来转换所有主要的 unicode 字符并且投票的答案遗漏了一些,所以我创建了 CodeIgniter 的 convert_accented_characters($str) 的一个版本,它可以轻松定制为 C#:

            using System;
            using System.Text;
            using System.Collections.Generic;
            
            public static class Strings
            {
                static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
                {
                    { "äæǽ", "ae" },
                    { "öœ", "oe" },
                    { "ü", "ue" },
                    { "Ä", "Ae" },
                    { "Ü", "Ue" },
                    { "Ö", "Oe" },
                    { "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
                    { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
                    { "Б", "B" },
                    { "б", "b" },
                    { "ÇĆĈĊČ", "C" },
                    { "çćĉċč", "c" },
                    { "Д", "D" },
                    { "д", "d" },
                    { "ÐĎĐΔ", "Dj" },
                    { "ðďđδ", "dj" },
                    { "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
                    { "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
                    { "Ф", "F" },
                    { "ф", "f" },
                    { "ĜĞĠĢΓГҐ", "G" },
                    { "ĝğġģγгґ", "g" },
                    { "ĤĦ", "H" },
                    { "ĥħ", "h" },
                    { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
                    { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
                    { "Ĵ", "J" },
                    { "ĵ", "j" },
                    { "ĶΚК", "K" },
                    { "ķκк", "k" },
                    { "ĹĻĽĿŁΛЛ", "L" },
                    { "ĺļľŀłλл", "l" },
                    { "М", "M" },
                    { "м", "m" },
                    { "ÑŃŅŇΝН", "N" },
                    { "ñńņňʼnνн", "n" },
                    { "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
                    { "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
                    { "П", "P" },
                    { "п", "p" },
                    { "ŔŖŘΡР", "R" },
                    { "ŕŗřρр", "r" },
                    { "ŚŜŞȘŠΣС", "S" },
                    { "śŝşșšſσςс", "s" },
                    { "ȚŢŤŦτТ", "T" },
                    { "țţťŧт", "t" },
                    { "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
                    { "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
                    { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
                    { "ýÿŷỳỹỷỵй", "y" },
                    { "В", "V" },
                    { "в", "v" },
                    { "Ŵ", "W" },
                    { "ŵ", "w" },
                    { "ŹŻŽΖЗ", "Z" },
                    { "źżžζз", "z" },
                    { "ÆǼ", "AE" },
                    { "ß", "ss" },
                    { "IJ", "IJ" },
                    { "ij", "ij" },
                    { "Œ", "OE" },
                    { "ƒ", "f" },
                    { "ξ", "ks" },
                    { "π", "p" },
                    { "β", "v" },
                    { "μ", "m" },
                    { "ψ", "ps" },
                    { "Ё", "Yo" },
                    { "ё", "yo" },
                    { "Є", "Ye" },
                    { "є", "ye" },
                    { "Ї", "Yi" },
                    { "Ж", "Zh" },
                    { "ж", "zh" },
                    { "Х", "Kh" },
                    { "х", "kh" },
                    { "Ц", "Ts" },
                    { "ц", "ts" },
                    { "Ч", "Ch" },
                    { "ч", "ch" },
                    { "Ш", "Sh" },
                    { "ш", "sh" },
                    { "Щ", "Shch" },
                    { "щ", "shch" },
                    { "ЪъЬь", "" },
                    { "Ю", "Yu" },
                    { "ю", "yu" },
                    { "Я", "Ya" },
                    { "я", "ya" },
                };
            
                public static char RemoveDiacritics(this char c){
                    foreach(KeyValuePair<string, string> entry in foreign_characters)
                    {
                        if(entry.Key.IndexOf (c) != -1)
                        {
                            return entry.Value[0];
                        }
                    }
                    return c;
                }
            
                public static string RemoveDiacritics(this string s) 
                {
                    //StringBuilder sb = new StringBuilder ();
                    string text = "";
            
            
                    foreach (char c in s)
                    {
                        int len = text.Length;
            
                        foreach(KeyValuePair<string, string> entry in foreign_characters)
                        {
                            if(entry.Key.IndexOf (c) != -1)
                            {
                                text += entry.Value;
                                break;
                            }
                        }
            
                        if (len == text.Length) {
                            text += c;  
                        }
                    }
                    return text;
                }
            }
            

            用法

            // for strings
            "crème brûlée".RemoveDiacritics (); // creme brulee
            
            // for chars
            "Ã"[0].RemoveDiacritics (); // A
            

            【讨论】:

            • 您的实现可以完成这项工作,但在用于生产代码之前应该进行改进。
            • 为什么不简单地将这个if (entry.Key.IndexOf(c) != -1)替换成if (entry.Key.Contains(c))
            • 我使用@Alexander 的链接在下面给出了答案:stackoverflow.com/a/56797567/479701
            • 我不明白为什么会有这么多的箍跳来使用{ "äæǽ", "ae" } 而不是{ "ä", "ae" }, { "æ", "ae" }, { "ǽ", "ae" } 而只是调用if (foreign_characters.TryGetValue(...)) ...。您已经完全破坏了字典已有索引的目的。
            【解决方案12】:

            Greek (ISO)的CodePage可以做到

            关于此代码页的信息在System.Text.Encoding.GetEncodings() 中。了解详情:https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

            希腊语 (ISO) 有代码页 28597 和名称 iso-8859-7

            转到代码... \o/

            string text = "Você está numa situação lamentável";
            
            string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
            //result: "Voce+esta+numa+situacao+lamentavel"
            
            string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
            //result: "Voce esta numa situacao lamentavel"
            

            所以,写这个函数……

            public string RemoveAcentuation(string text)
            {
                return
                    System.Web.HttpUtility.UrlDecode(
                        System.Web.HttpUtility.UrlEncode(
                            text, Encoding.GetEncoding("iso-8859-7")));
            }
            

            注意...Encoding.GetEncoding("iso-8859-7") 等价于Encoding.GetEncoding(28597),因为第一个是名称,第二个是编码的代码页。

            【讨论】:

            • 太棒了!简短而高效!
            • 好东西。我测试的几乎所有字符都通过了。 (äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔÓŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ ţŢşŞçÇ øı)。只有ßə 才发现问题,将其转换为?,但此类异常始终可以单独处理。在将其投入生产之前,最好对所有包含变音符号字母的 Unicode 区域进行测试。
            【解决方案13】:

            有趣的是,这样的问题可以得到这么多答案,但没有一个符合我的要求 :) 周围有这么多语言,AFAIK 不可能提供完整的语言不可知论解决方案,因为其他人提到 FormC 或 FormD 是给出问题。

            由于最初的问题与法语有关,因此最简单的工作答案确实是

                public static string ConvertWesternEuropeanToASCII(this string str)
                {
                    return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
                }
            

            1251应替换为输入语言的编码码。

            然而,这仅用一个字符替换一个字符。由于我也使用德语作为输入,因此我进行了手动转换

                public static string LatinizeGermanCharacters(this string str)
                {
                    StringBuilder sb = new StringBuilder(str.Length);
                    foreach (char c in str)
                    {
                        switch (c)
                        {
                            case 'ä':
                                sb.Append("ae");
                                break;
                            case 'ö':
                                sb.Append("oe");
                                break;
                            case 'ü':
                                sb.Append("ue");
                                break;
                            case 'Ä':
                                sb.Append("Ae");
                                break;
                            case 'Ö':
                                sb.Append("Oe");
                                break;
                            case 'Ü':
                                sb.Append("Ue");
                                break;
                            case 'ß':
                                sb.Append("ss");
                                break;
                            default:
                                sb.Append(c);
                                break;
                        }
                    }
                    return sb.ToString();
                }
            

            它可能无法提供最佳性能,但至少它非常易于阅读和扩展。 正则表达式是不行的,比任何字符/字符串都慢。

            我还有一个很简单的去除空格的方法:

                public static string RemoveSpace(this string str)
                {
                    return str.Replace(" ", string.Empty);
                }
            

            最终,我使用了上述所有 3 个扩展的组合:

                public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
                {
                    str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
                    return keepSpace ? str : str.RemoveSpace();
                }
            

            还有一个小单元测试(不是详尽的)成功通过。

                [TestMethod()]
                public void LatinizeAndConvertToASCIITest()
                {
                    string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
                    string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
                    string actual = europeanStr.LatinizeAndConvertToASCII();
                    Assert.AreEqual(expected, actual);
                }
            

            【讨论】:

              【解决方案14】:

              我非常喜欢azrafe7 提供的简洁实用的代码。 所以,我稍微改动了一下,将其转换为扩展方法:

              public static class StringExtensions
              {
                  public static string RemoveDiacritics(this string text)
                  {
                      const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";
              
                      if (string.IsNullOrEmpty(text))
                      {
                          return string.Empty;
                      }
              
                      return Encoding.ASCII.GetString(
                          Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
                  }
              }
              

              【讨论】:

              • 这是唯一适用于所有波兰变音符号的方法。接受的答案不适用于 Ł 和 ł 字符。
              【解决方案15】:

              如果您还没有考虑过,请在此处弹出此库。看起来有各种各样的单元测试。

              https://github.com/thomasgalliker/Diacritics.NET

              【讨论】:

                【解决方案16】:

                没有足够的声誉,显然我无法评论亚历山大的优秀链接。 - Lucene 似乎是在合理通用情况下工作的唯一解决方案。

                对于那些想要一个简单的复制粘贴解决方案的人来说,这里就是利用 Lucene 中的代码:

                string testbed = "ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáââãåæçèéêëìíîïïðñóôööøúüāăčĐęğıŁłńŌōřŞşšźžžşțệủ";

                Console.WriteLine(Lucene.latinizeLucene(testbed));

                AAAACEIIOOOUUTHaaaaaaaaeeeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

                //////////

                public static class Lucene
                {
                    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
                    // idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
                    public static string latinizeLucene(string arg)
                    {
                        char[] argChar = arg.ToCharArray();
                
                        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
                        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();
                
                        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);
                
                        string ret = new string(resultChar);
                        ret = ret.Substring(0, outputPos);
                
                        return ret;
                    }
                
                    /// <summary>
                    /// Converts characters above ASCII to their ASCII equivalents.  For example,
                    /// accents are removed from accented characters. 
                    /// <para/>
                    /// @lucene.internal
                    /// </summary>
                    /// <param name="input">     The characters to fold </param>
                    /// <param name="inputPos">  Index of the first character to fold </param>
                    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
                    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
                    /// <param name="length">    The number of characters to fold </param>
                    /// <returns> length of output </returns>
                    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
                    {
                        int end = inputPos + length;
                        for (int pos = inputPos; pos < end; ++pos)
                        {
                            char c = input[pos];
                
                            // Quick test: if it's not in range then just keep current character
                            if (c < '\u0080')
                            {
                                output[outputPos++] = c;
                            }
                            else
                            {
                                switch (c)
                                {
                                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                                    case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
                                    case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
                                    case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
                                    case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                                    case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
                                    case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                                    case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                                    case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                                    case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                                    case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                                    case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                                    case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
                                    case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
                                    case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
                                    case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                                    case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                                    case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                                    case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                                    case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                                    case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                                    case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                                    case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                                    case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                                    case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                                    case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                                    case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                                    case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
                                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                                        output[outputPos++] = 'A';
                                        break;
                                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                                    case '\u0101': // ā  [LATIN SMALL LETTER A WITH MACRON]
                                    case '\u0103': // ă  [LATIN SMALL LETTER A WITH BREVE]
                                    case '\u0105': // ą  [LATIN SMALL LETTER A WITH OGONEK]
                                    case '\u01CE': // ǎ  [LATIN SMALL LETTER A WITH CARON]
                                    case '\u01DF': // ǟ  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                                    case '\u01E1': // ǡ  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                                    case '\u01FB': // ǻ  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                                    case '\u0201': // ȁ  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                                    case '\u0203': // ȃ  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                                    case '\u0227': // ȧ  [LATIN SMALL LETTER A WITH DOT ABOVE]
                                    case '\u0250': // ɐ  [LATIN SMALL LETTER TURNED A]
                                    case '\u0259': // ə  [LATIN SMALL LETTER SCHWA]
                                    case '\u025A': // ɚ  [LATIN SMALL LETTER SCHWA WITH HOOK]
                                    case '\u1D8F': // ᶏ  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                                    case '\u1D95': // ᶕ  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                                    case '\u1E01': // ạ  [LATIN SMALL LETTER A WITH RING BELOW]
                                    case '\u1E9A': // ả  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                                    case '\u1EA1': // ạ  [LATIN SMALL LETTER A WITH DOT BELOW]
                                    case '\u1EA3': // ả  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                                    case '\u1EA5': // ấ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                                    case '\u1EA7': // ầ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                                    case '\u1EA9': // ẩ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                                    case '\u1EAB': // ẫ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                                    case '\u1EAD': // ậ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                                    case '\u1EAF': // ắ  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                                    case '\u1EB1': // ằ  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                                    case '\u1EB3': // ẳ  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                                    case '\u1EB5': // ẵ  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                                    case '\u1EB7': // ặ  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                                    case '\u2090': // ₐ  [LATIN SUBSCRIPT SMALL LETTER A]
                                    case '\u2094': // ₔ  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                                    case '\u24D0': // ⓐ  [CIRCLED LATIN SMALL LETTER A]
                                    case '\u2C65': // ⱥ  [LATIN SMALL LETTER A WITH STROKE]
                                    case '\u2C6F': // Ɐ  [LATIN CAPITAL LETTER TURNED A]
                                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                                        output[outputPos++] = 'a';
                                        break;
                                    case '\uA732': // Ꜳ  [LATIN CAPITAL LETTER AA]
                                        output[outputPos++] = 'A';
                                        output[outputPos++] = 'A';
                                        break;
                                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                                    case '\u01E2': // Ǣ  [LATIN CAPITAL LETTER AE WITH MACRON]
                                    case '\u01FC': // Ǽ  [LATIN CAPITAL LETTER AE WITH ACUTE]
                                    case '\u1D01': // ᴁ  [LATIN LETTER SMALL CAPITAL AE]
                                        output[outputPos++] = 'A';
                                        output[outputPos++] = 'E';
                                        break;
                                    case '\uA734': // Ꜵ  [LATIN CAPITAL LETTER AO]
                                        output[outputPos++] = 'A';
                                        output[outputPos++] = 'O';
                                        break;
                                    case '\uA736': // Ꜷ  [LATIN CAPITAL LETTER AU]
                                        output[outputPos++] = 'A';
                                        output[outputPos++] = 'U';
                                        break;
                
                        // etc. etc. etc.
                        // see link above for complete source code
                        // 
                        // unfortunately, postings are limited, as in
                        // "Body is limited to 30000 characters; you entered 136098."
                
                                    [...]
                
                                    case '\u2053': // ⁓  [SWUNG DASH]
                                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                                        output[outputPos++] = '~';
                                        break;
                                    default:
                                        output[outputPos++] = c;
                                        break;
                                }
                            }
                        }
                        return outputPos;
                    }
                }
                

                【讨论】:

                  【解决方案17】:

                  TL;DR - C# string extension method

                  我认为保留字符串含义的最佳解决方案是转换字符而不是剥离它们,这在 crème brûléecrme brlecreme brulee 的示例中得到了很好的说明。

                  我查看了Alexander's comment above 并看到 Lucene.Net 代码是 Apache 2.0 许可的,所以我将这个类修改为一个简单的字符串扩展方法。你可以这样使用它:

                  var originalString = "crème brûlée";
                  var maxLength = originalString.Length; // limit output length as necessary
                  var foldedString = originalString.FoldToASCII(maxLength); 
                  // "creme brulee"
                  

                  函数太长,无法在 StackOverflow 中发布答案(允许 30k 的约 139k 个字符,哈哈)所以I made a gist and attributed the authors

                  /*
                   * Licensed to the Apache Software Foundation (ASF) under one or more
                   * contributor license agreements.  See the NOTICE file distributed with
                   * this work for additional information regarding copyright ownership.
                   * The ASF licenses this file to You under the Apache License, Version 2.0
                   * (the "License"); you may not use this file except in compliance with
                   * the License.  You may obtain a copy of the License at
                   *
                   *     http://www.apache.org/licenses/LICENSE-2.0
                   *
                   * Unless required by applicable law or agreed to in writing, software
                   * distributed under the License is distributed on an "AS IS" BASIS,
                   * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
                   * See the License for the specific language governing permissions and
                   * limitations under the License.
                   */
                  
                  /// <summary>
                  /// This class converts alphabetic, numeric, and symbolic Unicode characters
                  /// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
                  /// block) into their ASCII equivalents, if one exists.
                  /// <para/>
                  /// Characters from the following Unicode blocks are converted; however, only
                  /// those characters with reasonable ASCII alternatives are converted:
                  /// 
                  /// <ul>
                  ///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
                  ///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
                  ///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
                  ///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
                  ///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
                  ///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
                  ///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
                  ///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
                  ///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
                  ///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
                  ///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
                  ///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
                  ///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
                  ///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
                  ///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
                  ///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
                  /// </ul>
                  /// <para/>
                  /// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
                  /// <para/>
                  /// For example, '&amp;agrave;' will be replaced by 'a'.
                  /// </summary>
                  public static partial class StringExtensions
                  {
                      /// <summary>
                      /// Converts characters above ASCII to their ASCII equivalents.  For example,
                      /// accents are removed from accented characters. 
                      /// </summary>
                      /// <param name="input">     The string of characters to fold </param>
                      /// <param name="length">    The length of the folded return string </param>
                      /// <returns> length of output </returns>
                      public static string FoldToASCII(this string input, int? length = null)
                      {
                          // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
                      }
                  }
                  

                  希望对其他人有所帮助,这是我找到的最强大的解决方案!

                  【讨论】:

                  • 注意事项:1) 概念取决于区域设置。例如,“ä”可以是“a”或“aa”。 2) 错误命名/错误描述:结果不一定仅来自 C0 Controls 和 Basic Latin 块。它仅将拉丁字母和一些符号变体转换为“等价物”。 (当然,之后可以再通过一次来替换或删除非 C0 控件和基本拉丁语块字符。)但这会做得很好。
                  • 感谢您发布此内容。我相信您在文件末尾有一个尾随 } 括号。
                  【解决方案18】:

                  这段代码对我有用:

                  var updatedText = text.Normalize(NormalizationForm.FormD)
                       .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                       .ToArray();
                  

                  但是,请不要对名字这样做。这不仅是对名字中带有元音变音/重音的人的侮辱,在某些情况下也可能是危险的错误(见下文)。除了去掉重音之外,还有其他的写作方式。

                  此外,这完全是错误和危险的,例如如果用户必须准确地提供他的姓名在护照上的显示方式。

                  例如,我的名字写成Zuberbühler,在我护照的机器可读部分,您会找到Zuberbuehler。通过删除变音符号,名称将不匹配任何部分。这可能会给用户带来问题。

                  您应该在姓名输入表单中禁止使用元音变音/重音,这样用户就可以正确地写出他的名字,而不会使用元音变音或重音。

                  实际示例,如果申请 ESTA 的 Web 服务 (https://www.application-esta.co.uk/special-characters-and) 使用上述代码而不是正确转换元音变音,则 ESTA 申请将被拒绝,或者旅行者在进入美国边境管制局时会遇到问题国家。

                  另一个例子是机票。假设您有一个机票预订 Web 应用程序,用户为他的名字提供了一个重音符号,而您的实现只是删除重音符号,然后使用航空公司的 Web 服务来预订机票!您的客户可能不被允许登机,因为姓名与他/她护照的任何部分都不匹配。

                  【讨论】:

                  • 这对韩语不起作用,需要FormC。
                  【解决方案19】:

                  与接受的答案相同,但速度更快,使用 Span 而不是 StringBuilder
                  需要 .NET Core 3.1 或更新的 .NET。

                  static string RemoveDiacritics(string text) 
                  {
                      ReadOnlySpan<char> normalizedString = text.Normalize(NormalizationForm.FormD);
                      int i = 0;
                      Span<char> span = text.Length < 1000
                          ? stackalloc char[text.Length]
                          : new char[text.Length];
                  
                      foreach (char c in normalizedString)
                      {
                          if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                              span[i++] = c;
                      }
                  
                      return new string(span).Normalize(NormalizationForm.FormC);
                  }
                  

                  这也可扩展用于其他字符替换,例如波兰语 Ł.

                  span[i++] = c switch
                  {
                      'Ł' = 'L',
                      'ł' => 'l',
                      _ => c
                  };
                  

                  一个小提示:堆栈分配stackalloc 比堆分配new 更快,并且它减少了垃圾收集器的工作。 1000 是避免在堆栈上分配大型结构的阈值,这可能导致StackOverflowException。虽然 1000 是一个相当安全的值,但在大多数情况下 10000 甚至 100000 也可以工作(100k 在堆栈上分配高达 200kB 而默认堆栈大小为 1 MB),但是 100k 对我来说有点危险。

                  【讨论】:

                    【解决方案20】:

                    接受的答案是完全正确的,但现在应该更新它以使用Rune 类而不是CharUnicodeInfo,因为 C# 和 .NET 更新了最新版本中分析字符串的方式(Rune 类已添加到.NET Core 3.0)。

                    现在推荐使用以下 .NET 5+ 的代码,因为它更适合非拉丁字符:

                    static string RemoveDiacritics(string text) 
                    {
                        var normalizedString = text.Normalize(NormalizationForm.FormD);
                        var stringBuilder = new StringBuilder();
                    
                        foreach (var c in normalizedString.EnumerateRunes())
                        {
                            var unicodeCategory = Rune.GetUnicodeCategory(c);
                            if (unicodeCategory != UnicodeCategory.NonSpacingMark)
                            {
                                stringBuilder.Append(c);
                            }
                        }
                    
                        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
                    }
                    

                    【讨论】:

                      猜你喜欢
                      • 2015-08-30
                      • 2012-12-10
                      • 1970-01-01
                      • 2022-01-03
                      • 2017-11-18
                      • 2011-04-07
                      相关资源
                      最近更新 更多