如何将 Html 转换为纯文本？答案

【问题标题】：How do you convert Html to plain text?如何将 Html 转换为纯文本？
【发布时间】：2010-09-22 03:45:54
【问题描述】：

我将 Html 的 sn-ps 存储在一个表中。 不是整个页面，没有标签等，只是基本格式。

我希望能够在给定页面上仅将 Html 显示为文本，无格式（实际上只是前 30 - 50 个字符，但这很容易）。

如何将 Html 中的“文本”作为纯文本放入字符串中？

所以这段代码。

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

变成：

你好，世界。外面有人吗？

【问题讨论】：

您可能想要使用 SgmlReader。 code.msdn.microsoft.com/SgmlReader
blackbeltcoder.com/Articles/strings/convert-html-to-text 有一些非常简单直接的代码可以将 HTML 转换为纯文本。
这是我需要的正确答案 - 谢谢！
这里有一些来自 W3C 的好建议：w3.org/Tools/html2things.html
如何将问题标记为与 6 个月后提出的问题重复？好像有点落后……

标签： c# asp.net html

【解决方案1】：

这是我的解决方案：

public string StripHTML(string html)
{
    if (string.IsNullOrWhiteSpace(html)) return "";

    // could be stored in static variable
    var regex = new Regex("<[^>]+>|\\s{2}", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode(regex.Replace(html, ""));
}

例子：

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

【讨论】：

【解决方案2】：

获得HtmlAgilityPack 许可的MIT 具有in one of its samples 一种将HTML 转换为纯文本的方法。

var plainText = HtmlUtilities.ConvertToPlainText(string html);

给它一个 HTML 字符串，比如

<b>hello, <i>world!</i></b>

你会得到一个纯文本结果，如：

hello world!

【讨论】：

我以前使用过 HtmlAgilityPack，但看不到任何对 ConvertToPlainText 的引用。你能告诉我在哪里可以找到它吗？
Horatio，它包含在 HtmlAgilityPack 附带的示例之一中：htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/…
实际上，Agility Pack 中并没有内置的方法。您链接到的是一个示例，它使用敏捷包遍历节点树，删除 script 和 style 标记并将其他元素的内部文本写入输出字符串。我怀疑它是否通过了真实世界输入的大量测试。
有人可以提供有效的代码，而不是需要改装才能正常工作的示例链接吗？
现在可以在这里找到示例：github.com/ceee/ReadSharp/blob/master/ReadSharp/…

【解决方案3】：

如果任何人都在寻找给定 html 文档的文本缩写的 OP 问题的确切解决方案，没有换行符和 HTML 标记，请在下面找到解决方案。

与每个提议的解决方案一样，以下代码有一些假设：

脚本或样式标签不应包含脚本和样式标签作为部分脚本
只有主要的内联元素会被内联，没有空格，即he<span>ll</span>o 应该输出hello。内联列表标签：https://www.w3schools.com/htmL/html_blocks.asp

考虑到上述情况，以下带有已编译正则表达式的字符串扩展名将输出与 html 转义字符有关的预期纯文本，并在 null 输入上输出 null。

public static class StringExtensions
{
    public static string ConvertToPlain(this string html)
    {
        if (html == null)
        {
            return html;
        }

        html = scriptRegex.Replace(html, string.Empty);
        html = inlineTagRegex.Replace(html, string.Empty);
        html = tagRegex.Replace(html, " ");
        html = HttpUtility.HtmlDecode(html);
        html = multiWhitespaceRegex.Replace(html, " ");

        return html.Trim();
    }

    private static readonly Regex inlineTagRegex = new Regex("<\\/?(a|span|sub|sup|b|i|strong|small|big|em|label|q)[^>]*>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex scriptRegex = new Regex("<(script|style)[^>]*?>.*?</\\1>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex tagRegex = new Regex("<[^>]+>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex multiWhitespaceRegex = new Regex("\\s+", RegexOptions.Compiled | RegexOptions.Singleline);
}

【讨论】：

【解决方案4】：

我认为它有一个简单的答案：

public string RemoveHTMLTags(string HTMLCode)
{
    string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", "");
    return str;
}

【讨论】：

【解决方案5】：

我遇到过类似的问题并找到了最佳解决方案。下面的代码非常适合我。

  private string ConvertHtml_Totext(string source)
    {
     try
      {
      string result;

    // Remove HTML Development formatting
    // Replace line breaks with space
    // because browsers inserts space
    result = source.Replace("\r", " ");
    // Replace line breaks with space
    // because browsers inserts space
    result = result.Replace("\n", " ");
    // Remove step-formatting
    result = result.Replace("\t", string.Empty);
    // Remove repeating spaces because browsers ignore them
    result = System.Text.RegularExpressions.Regex.Replace(result,
                                                          @"( )+", " ");

    // Remove the header (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*head([^>])*>","<head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*head( )*>)","</head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<head>).*(</head>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all scripts (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*script([^>])*>","<script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*script( )*>)","</script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    //result = System.Text.RegularExpressions.Regex.Replace(result,
    //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
    //         string.Empty,
    //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<script>).*(</script>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all styles (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*style([^>])*>","<style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*style( )*>)","</style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<style>).*(</style>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert tabs in spaces of <td> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*td([^>])*>","\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line breaks in places of <BR> and <LI> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*br( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*li( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line paragraphs (double line breaks) in place
    // if <P>, <DIV> and <TR> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*div([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*tr([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*p([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // Remove remaining tags like <a>, links, images,
    // comments etc - anything that's enclosed inside < >
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<[^>]*>",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // replace special characters:
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @" "," ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&bull;"," * ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lsaquo;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&rsaquo;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&trade;","(tm)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&frasl;","/",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lt;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&gt;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&copy;","(c)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&reg;","(r)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove all others. More can be added, see
    // http://hotwired.lycos.com/webmonkey/reference/special_characters/
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&(.{2,6});", string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // for testing
    //System.Text.RegularExpressions.Regex.Replace(result,
    //       this.txtRegex.Text,string.Empty,
    //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // make line breaking consistent
    result = result.Replace("\n", "\r");

    // Remove extra line breaks and tabs:
    // replace over 2 breaks with 2 and over 4 tabs with 4.
    // Prepare first to remove any whitespaces in between
    // the escaped characters and remove redundant tabs in between line breaks
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\t)","\t\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\r)","\t\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\t)","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove redundant tabs
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove multiple tabs following a line break with just one tab
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Initial replacement target string for line breaks
    string breaks = "\r\r\r";
    // Initial replacement target string for tabs
    string tabs = "\t\t\t\t\t";
    for (int index=0; index<result.Length; index++)
    {
        result = result.Replace(breaks, "\r\r");
        result = result.Replace(tabs, "\t\t\t\t");
        breaks = breaks + "\r";
        tabs = tabs + "\t";
    }

    // That's it.
    return result;
}
catch
{
    MessageBox.Show("Error");
    return source;
}

}

必须首先删除转义字符，例如 \n 和 \r，因为它们会导致正则表达式按预期停止工作。

此外，为了使结果字符串在文本框中正确显示，可能需要将其拆分并设置文本框的 Lines 属性，而不是分配给 Text 属性。

this.txtResult.Lines = StripHTML(this.txtSource.Text).Split("\r".ToCharArray());

来源：https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

【讨论】：

这对我来说几乎是完美的。我需要一个小修复。这种情况不会导致新行<li xmlns=\"http://www.w3.org/1999/xhtml\">。对正则表达式进行简单调整，我将 Regex.Replace(result, @"<( )*li( )*>", "\r" 修改为 Regex.Replace(result, @"<( )*li( )*[^>]*>", "\r"

【解决方案6】：

它有限制，不能折叠长的内联空格，但它绝对是可移植的，并且像 webbrowser 一样尊重布局。

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}

【讨论】：

@Prof.Falken 我承认。我认为每个代码都有优点和缺点。它的缺点是坚固，优点可能是简单（就 sloc 而言）。您可以使用XDocument 发布代码。

【解决方案7】：

不是写的而是使用的：

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;

namespace foo {
  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
  public static class HtmlToText {

    public static string Convert(string path) {
      HtmlDocument doc = new HtmlDocument();
      doc.Load(path);
      return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html) {
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      return ConvertDoc(doc);
    }

    public static string ConvertDoc(HtmlDocument doc) {
      using (StringWriter sw = new StringWriter()) {
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
      }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      foreach (HtmlNode subnode in node.ChildNodes) {
        ConvertTo(subnode, outText, textInfo);
      }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText) {
      ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      string html;
      switch (node.NodeType) {
        case HtmlNodeType.Comment:
          // don't output comments
          break;
        case HtmlNodeType.Document:
          ConvertContentTo(node, outText, textInfo);
          break;
        case HtmlNodeType.Text:
          // script and style must not be output
          string parentName = node.ParentNode.Name;
          if ((parentName == "script") || (parentName == "style")) {
            break;
          }
          // get text
          html = ((HtmlTextNode)node).Text;
          // is it in fact a special closing node output as text?
          if (HtmlNode.IsOverlappedClosingElement(html)) {
            break;
          }
          // check the text is meaningful and not a bunch of whitespaces
          if (html.Length == 0) {
            break;
          }
          if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
            html = html.TrimStart();
            if (html.Length == 0) { break; }
            textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
          }
          outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
          if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
            outText.Write(' ');
          }
          break;
        case HtmlNodeType.Element:
          string endElementString = null;
          bool isInline;
          bool skip = false;
          int listIndex = 0;
          switch (node.Name) {
            case "nav":
              skip = true;
              isInline = false;
              break;
            case "body":
            case "section":
            case "article":
            case "aside":
            case "h1":
            case "h2":
            case "header":
            case "footer":
            case "address":
            case "main":
            case "div":
            case "p": // stylistic - adjust as you tend to use
              if (textInfo.IsFirstTextOfDocWritten) {
                outText.Write("\r\n");
              }
              endElementString = "\r\n";
              isInline = false;
              break;
            case "br":
              outText.Write("\r\n");
              skip = true;
              textInfo.WritePrecedingWhiteSpace = false;
              isInline = true;
              break;
            case "a":
              if (node.Attributes.Contains("href")) {
                string href = node.Attributes["href"].Value.Trim();
                if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
                  endElementString = "<" + href + ">";
                }
              }
              isInline = true;
              break;
            case "li":
              if (textInfo.ListIndex > 0) {
                outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
              } else {
                outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
              }
              isInline = false;
              break;
            case "ol":
              listIndex = 1;
              goto case "ul";
            case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
              endElementString = "\r\n";
              isInline = false;
              break;
            case "img": //inline-block in reality
              if (node.Attributes.Contains("alt")) {
                outText.Write('[' + node.Attributes["alt"].Value);
                endElementString = "]";
              }
              if (node.Attributes.Contains("src")) {
                outText.Write('<' + node.Attributes["src"].Value + '>');
              }
              isInline = true;
              break;
            default:
              isInline = true;
              break;
          }
          if (!skip && node.HasChildNodes) {
            ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
          }
          if (endElementString != null) {
            outText.Write(endElementString);
          }
          break;
      }
    }
  }
  internal class PreceedingDomTextInfo {
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
      IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace { get; set; }
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
  }
  internal class BoolWrapper {
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper) {
      return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper) {
      return new BoolWrapper { Value = boolWrapper };
    }
  }
}

【讨论】：

【解决方案8】：

我也有同样的问题，只是我的 html 有一个简单的已知布局，例如：

<DIV><P>abc</P><P>def</P></DIV>

所以我最终使用了这么简单的代码：

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

哪些输出：

abc
def

【讨论】：

【解决方案9】：

将 HTML 转换为纯文本的三步过程

首先您需要为HtmlAgilityPack 安装 Nuget 包第二次创建此类

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

通过参考 Judah Himango 的回答使用上面的类

第三，你需要创建上述类的对象，并使用ConvertHtml(HTMLContent)方法将HTML转换为纯文本，而不是ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

【讨论】：

我可以跳过在 html 中转换链接吗？转换为文本时我需要在 html 中保留链接吗？

【解决方案10】：

我找到的最简单的方法：

HtmlFilter.ConvertToPlainText(html);

HtmlFilter 类位于 Microsoft.TeamFoundation.WorkItemTracking.Controls.dll 中

dll 可以在这样的文件夹中找到： %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

在 VS 2015 中，该 dll 还需要引用位于同一文件夹中的 Microsoft.TeamFoundation.WorkItemTracking.Common.dll。

【讨论】：

它是否处理脚本标签，是否格式化为粗斜体等？
引入一个团队基础依赖，用于将html转换为纯文本，非常值得怀疑...

【解决方案11】：

HTTPUtility.HTMLEncode() 用于将 HTML 标记编码为字符串。它会为您处理所有繁重的工作。来自MSDN Documentation：

如果在 HTTP 流中传递空格和标点符号等字符，它们可能会在接收端被误解。 HTML 编码将 HTML 中不允许的字符转换为字符实体等价物； HTML 解码反转编码。例如，当嵌入到文本块中时，字符 &lt; 和 &gt; 编码为 &lt; 和 &gt; 用于 HTTP 传输。

HTTPUtility.HTMLEncode()方法，详解here：

public static void HtmlEncode(
  string s,
  TextWriter output
)

用法：

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

【讨论】：

乔治感谢一个非常好的答案，它也突出了我第一次问这个问题有多糟糕。对不起。
html 敏捷包已过时，不支持 html5

【解决方案12】：

我无法使用 HtmlAgilityPack，所以我为自己编写了第二个最佳解决方案

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

【讨论】：

<blabla>被解析了，所以我移动了 text = System.Net.WebUtility.HtmlDecode(text);到方法的底部
这很棒，我还添加了一个多空间冷凝器，因为 html 可能是从 CMS 生成的：var spaceRegex = new Regex("[ ]{2,}", RegexOptions.None);
有时候，在html代码中有coder的新行（新行在注释中看不到，所以我用[新行]显示，比如：
我[新行]想念[新行]你
，所以它应该显示：“我想你”，但它显示我[新行]想念[新行]你。这使纯文本看起来很痛苦。你知道怎么做修复？
@123iamking 你可以在返回文本之前使用它； : text.Replace("[新行]", "\n");
我正在使用它并意识到有时它会在字符串的开头留下'>'。应用正则表达式 ]*> 的另一种解决方案工作正常。

【解决方案13】：

HtmlAgilityPack 中没有名称为“ConvertToPlainText”的方法，但您可以使用以下命令将 html 字符串转换为 CLEAR 字符串：

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

这对我有用。但我在“HtmlAgilityPack”中找不到名称为“ConvertToPlainText”的方法。

【讨论】：

好的，这个不好 - 因为您使用附加库只是为了查找文档根节点，然后在整个根节点上应用正则表达式？要么使用 HtmlAgilityPack 逐个节点解析 html 节点，要么使用正则表达式将整个文本作为一个整体来处理。

【解决方案14】：

我认为最简单的方法是制作一个“字符串”扩展方法（根据用户 Richard 的建议）：

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

然后只需在程序中的任何“字符串”变量上使用此扩展方法：

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

我使用这个扩展方法将html格式的cmets转换为纯文本，这样就可以在水晶报表上正确显示，而且效果很好！

【讨论】：

【解决方案15】：

要添加到 vfilby 的答案，您可以在代码中执行 RegEx 替换；不需要新的课程。以防其他像我这样的新手遇到这个问题。

using System.Text.RegularExpressions;

那么……

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

【讨论】：

不好！这可以通过省略右尖括号来欺骗包含脚本。伙计们，永远不要列入黑名单。您无法通过列入黑名单来清理输入。这是大错特错了。

【解决方案16】：

如果您谈论的是标签剥离，那么如果您不必担心<script> 标签之类的事情，那就相对简单了。如果您需要做的只是显示没有标签的文本，您可以使用正则表达式来完成：

<[^>]*>

如果您确实需要担心<script> 标记等，那么您将需要比正则表达式更强大的东西，因为您需要跟踪状态，更像是上下文无关语法 (CFG)。尽管您可以通过“从左到右”或非贪婪匹配来完成它。

如果您可以使用正则表达式，那么有很多网页提供了很好的信息：

如果您需要更复杂的 CFG 行为，我建议您使用第三方工具，遗憾的是我不知道有什么好推荐的。

【讨论】：

您还必须担心 > 在 XML 中的属性值、cmets、PIs/CDATA 以及旧版 HTML 中的各种常见格式错误。通常 [X][HT]ML 不适合使用正则表达式进行解析。
这是一种糟糕的方法。正确的做法是用lib解析HTML，遍历dom只输出白名单内容。
@usr：您所指的部分是答案的CFG部分。正则表达式可用于快速和肮脏的标签剥离，它有它的弱点，但它又快又容易。对于更复杂的解析，请使用基于 CFG 的工具（用您的话说是生成 DOM 的库）。我还没有进行测试，但我敢打赌 DOM 解析比正则表达式剥离要慢，以防需要考虑性能。
@vfilby：不！标签剥离是黑名单。举个例子，您忘记了：您的正则表达式不会删除缺少结束“>”的标签。你有想过吗？我不确定这是否会成为问题，但这至少证明您错过了这个案例。谁知道你还错过了什么。这里还有一个：你错过了带有 javascript src 属性的图像。除非安全不重要，否则切勿将其列入黑名单。
@vfilby，想到的第一个攻击是编写 "

【解决方案17】：

取决于您所说的“html”。最复杂的情况是完整的网页。这也是最容易处理的，因为您可以使用文本模式的网络浏览器。请参阅Wikipedia article 列出的 Web 浏览器，包括文本模式浏览器。 Lynx 可能是最著名的，但其他之一可能更适合您的需求。

【讨论】：

正如他所说的“我将 Html 的 sn-ps 存储在一个表中。”

【解决方案18】：

公共静态字符串 StripTags2(字符串 html) { 返回 html.Replace("", ">"); }

通过这种方式，您可以转义字符串中的所有“”。这是你想要的吗？

【讨论】：

...啊。好吧，现在答案（以及对模棱两可问题的解释）已经完全改变了，我会在缺少 & amp; 时挑剔。而是编码。 ;-)
我不认为重新发明轮子是个好主意——尤其是当你的轮子是方形的时候。您应该改用 HTMLEncode。

【解决方案19】：

如果您有带有 HTML 标记的数据并且想要显示它以便人们可以看到这些标记，请使用 HttpServerUtility::HtmlEncode。

如果您的数据中包含 HTML 标记，并且您希望用户看到呈现的标记，则按原样显示文本。如果文本代表整个网页，请为其使用 IFRAME。

如果您有带有 HTML 标签的数据，并且您想去掉标签并只显示未格式化的文本，请使用正则表达式。

【讨论】：

在 php 中有一个函数叫做 striptags() 也许你有类似的东西
“使用正则表达式”不！这将被列入黑名单。您只能安全地进行白名单。例如，您还记得样式属性可以包含“background: url('javascript:...');”吗？当然不是，我也不会。这就是为什么黑名单不起作用的原因。