将字符串修剪为忽略 HTML 的长度答案

【问题标题】：Trim string to length ignoring HTML将字符串修剪为忽略 HTML 的长度
【发布时间】：2009-04-09 22:50:14
【问题描述】：

这个问题是一个具有挑战性的问题。我们的应用程序允许用户在主页上发布新闻。该新闻是通过允许 HTML 的富文本编辑器输入的。在主页上，我们只想显示新闻项目的截断摘要。

例如，这里是我们显示的全文，包括 HTML

为了在办公室和厨房里腾出更多空间，我把所有随机的杯子都拿出来放在午餐室的桌子上。除非您对 1992 年的 Cheyenne Courier 马克杯或 1997 年的 BC Tel Advanced Communications 马克杯的所有权有强烈的感觉，否则它们将被放入一个盒子并捐赠给比我们更需要马克杯的办公室。强>

我们希望将新闻项修剪为 250 个字符，但不包括 HTML。

我们目前用于修剪的方法包括 HTML，这会导致一些 HTML 重的新闻帖子被大量截断。

例如，如果上面的示例包含大量 HTML，它可能看起来像这样：

为了在办公室、厨房里腾出更多空间，我拉了...

这不是我们想要的。

有没有人可以对 HTML 标记进行标记，以便在字符串中保持位置、对字符串执行长度检查和/或修剪，并将字符串中的 HTML 恢复到其旧位置？

【问题讨论】：

我猜问题在于一旦达到最大文本长度就关闭打开的标签..
我们可以从您的个人资料中假设该应用是用 asp.net 编写的吗？
是的，它是 ASP.NET，C#。为了解决结束标签，我们只需通过 SGML Reader 运行它，将其转换回 XHTML。

标签： html string truncate tokenize

【解决方案1】：

从帖子的第一个字符开始，遍历每个字符。每次你越过一个角色，增加一个计数器。当你找到一个 '' 字符。当计数器到达 250 时，您的位置是您真正想要切断的位置。

请注意，当 HTML 标记在截止之前打开但未关闭时，您将不得不处理另一个问题。

【讨论】：

您离问题太近而无法找到最简单的解决方案真是令人惊讶。这就像一个魅力。
第一次遇到“”时会遇到麻烦。除非您可以 100% 确定您的短信永远不会包含这些字符。
是的，我们在此过程之前对内容进行编码。
你需要添加一堆打开的标签（当你找到，以先到者为准），并在一个关闭时弹出，完成后弹出堆栈中的所有项目，添加结束标记。
@Osama ALASSIRY：不，那太愚蠢了，因为一个 HTML 标签不能包含另一个 HTML 标签（即是合法的，但 > 不是。 )

【解决方案2】：

按照 2-state 有限机器的建议，我刚刚为此目的开发了一个简单的 HTML 解析器，用 Java 编写：

http://pastebin.com/jCRqiwNH

这里有一个测试用例：

http://pastebin.com/37gCS4tV

这里是 Java 代码：

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

【讨论】：

【解决方案3】：

如果我对问题的理解正确，您希望保留 HTML 格式，但不希望将其计入您保留的字符串长度的一部分。

您可以使用实现简单finite state machine 的代码来完成此操作。

2 个状态：InTag、OutOfTag
InTag：
- 如果遇到> 字符，则转到 OutOfTag
- 遇到任何其他字符时转到自身
OutOfTag：
- 如果遇到< 字符，则转到 InTag
- 遇到任何其他字符时转到自身

您的起始状态将是 OutOfTag。

您通过一次处理 1 个字符来实现有限状态机。每个角色的处理都将您带到一个新的状态。

当您通过有限状态机运行文本时，您还希望保留一个输出缓冲区和到目前为止遇到的长度变量（以便您知道何时停止）。

每次处于 OutOfTag 状态并处理另一个字符时，增加 Length 变量。如果您有空格字符，您可以选择不增加此变量。
当您没有更多字符或您具有#1 中提到的所需长度时，您将结束算法。
在您的输出缓冲区中，包含您遇到的字符，直到 #1 中提到的长度。
保留一堆未封闭的标签。当您达到长度时，为堆栈中的每个元素添加一个结束标记。在运行算法时，您可以通过保留 current_tag 变量来知道何时遇到标签。此 current_tag 变量在您进入 InTag 状态时启动，并在您进入 OutOfTag 状态时结束（或在 InTag 状态下遇到白色字符时）。如果您有开始标签，则将其放入堆栈中。如果您有结束标记，则将其从堆栈中弹出。

【讨论】：

【解决方案4】：

这是我在 C# 中提出的实现：

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

还有一些我通过 TDD 使用的单元测试：

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

【讨论】：

如果你有一个表格作为你的 html 的一部分会发生什么？您的代码不会修剪标记中间的字符串，但它可能会在标记关闭之前修剪字符串。
它会怎么做，因为它不会修剪打开的标签内部。