需要正则表达式返回第一段或前 n 个单词答案

【问题标题】：Need RegEx to return first paragraph or first n words需要正则表达式返回第一段或前 n 个单词
【发布时间】：2025-11-28 16:00:02
【问题描述】：

我正在寻找一个正则表达式来返回段落中的前 [n] 个单词，或者，如果该段落包含少于 [n] 个单词，则返回整个段落。

例如，假设我最多需要前 7 个单词：

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

我会得到：

one two <tag>three</tag> four five, six seven

并且在包含少于请求的单词数的段落上使用相同的 RegEx：

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

只会返回：

one two <tag>three</tag> four five.

我对该问题的尝试导致了以下正则表达式：

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

但是，这只会返回第一个单词 - “one”。它不起作用。我觉得。*？（在 \w+\b 之后）导致问题。

我哪里出错了？任何人都可以提供一个可以工作的正则表达式吗？

仅供参考，我正在使用 .Net 3.5 的 RegEX 引擎（通过 C#）

非常感谢

【问题讨论】：

标签： c# regex

【解决方案1】：

好的，完成重新编辑以确认新的“规范”:)

我很确定你不能用一个正则表达式来做到这一点。最好的工具绝对是 HTML 解析器。我能用正则表达式最接近的是两步法。

首先，将每个段落的内容隔离：

<p>(.*?)</p>

如果段落可以跨多行，则需要设置RegexOptions.Singleline。

然后，在下一步中，迭代您的匹配项并在每个匹配项的 Group[1].Value 上应用一次以下正则表达式：

((?:(\S+\s+){1,6})\w+)

这将匹配由空格/制表符/换行符分隔的前七个项目，忽略任何尾随标点符号或非单词字符。

但是它将以空格分隔的标签视为这些项目之一，即。 e.在

One, two three <br\> four five six seven

它只会匹配到six。我想正则表达式，没有办法解决这个问题。

【讨论】：

这太完美了——干杯！我知道永远不会有嵌套的 p 标签，所以 RegEx 很适合。
感谢您的努力 - 我非常感谢（感谢您指出我原来的“规范”的疏忽）

【解决方案2】：

使用 HTML 解析器获取第一段，使其结构扁平化（即删除段落内的装饰性 HTML 标记）。
搜索第 n 个空格字符的位置。
将子字符串从 0 带到那个位置。

编辑：我删除了第 2 步和第 3 步的正则表达式提案，因为它是错误的（感谢评论者）。此外，HTML 结构需要扁平化。

【讨论】：

在字符类中，\b 匹配退格字符。此外，自从您发布此问题后，问题定义似乎已更改； \w 和 \W 不会削减它。

【解决方案3】：

我遇到了同样的问题，并将一些 Stack Overflow 答案合并到这门课中。它使用 HtmlAgilityPack，这是一个更好的工作工具。调用：

 Words(string html, int n)

得到 n 个单词

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://*.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://*.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

圣诞快乐！

【讨论】：