【发布时间】:2019-08-31 04:39:21
【问题描述】:
我有一个 HTML 文件,其内容如下:
</div><div class="\"more-detail-caption\"">More Numbers :</div><div id="\"moreHLNumbers\"" title="\"HSBC" bank="" helpline="" number\"="" class="\"more-detail-text\""><a href='tel:18605002277'>1860 500 2277 </a><a class='cchlOtherNoDescription'>( Credit Card - From India )</a><br><a href='tel:18602662667'>1860 266 2667 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18605002255'>1860 500 2255 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18004192266'>1800 419 2266 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18001026922'>1800 102 6922 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18002673456'>1800 267 3456 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18001022208'>1800 102 2208 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18002663456'>1800 266 3456 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:18001034722'>1800 103 4722 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:+912266800001'>022 66800001 </a><a class='cchlOtherNoDescription'>( Credit Card - From Overseas )
我想使用正则表达式及其描述来提取这些数字。例如: “1860 266 2667(个人银行业务 - 来自印度)”。除了它对应的xpath,使用c#。 到目前为止,我已经弄清楚了以下代码,它只是删除了额外的标签,并定义了用于提取数字的正则表达式。
using System.IO;
using System.Linq;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
public class Program
{
private static string phoneReg = @"[\+]{0,1}(\d{10,13}|[\(][\+]{0,1}\d{2,}[\13)]*\d{5,13}|\d{2,6}[\-]{1}\d{2,13}[\-]*\d{3,13})";
private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
public static void Main()
{
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\htmldoc\htmlsample.html");
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style" || n.Name == "svg" || n.Name == "button"
|| n.Name == "li" || n.Name == "link" || n.Name == "img" || n.Name == "head" || n.Name == "header" || n.Name == "input")
.ToList()
.ForEach(n => n.Remove());
var phoneMatches = phoneRegex.Matches(doc.DocumentNode.InnerText);
File.WriteAllText(@"C:\htmldoc\new.html", doc.DocumentNode.InnerHtml.Replace(@"\t", ""));
}
}
}
但是,我也面临一些提取数字的问题。 有人可以帮我解决这个问题吗?
提前致谢。
【问题讨论】:
-
嗨,你为什么不想使用 html 解析器来完成像 Html 敏捷包这样的工作:html-agility-pack.net/?z=codeplex。对我来说这听起来容易多了?
-
我已经完成了,我需要描述以及电话号码。我正在使用 HTMLAgilitypack。
标签: c# html regex xpath html-agility-pack