使用正则表达式从 html 字符串值中删除 html 属性答案

【问题标题】：removing html attributes from an html string value using regex使用正则表达式从 html 字符串值中删除 html 属性
【发布时间】：2021-11-22 18:21:05
【问题描述】：

我需要从 html 字符串中删除 html 属性。我有一些格式化的文本输入字段，允许用户复制和粘贴文本，同时保留基本的 html。问题是从 word doc 中复制的一些文本带有需要删除的属性。目前，我正在使用的正则表达式在正则表达式测试器中工作，但没有任何属性被删除。

删除属性的代码：

var stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

var regex = /[a-zA-Z]*=".*?"/;

var replacedstring = stringhtml.replace(regex, '');

document.write(replacedstring);

感谢任何帮助！

【问题讨论】：

你忘记了g 标志：/[a-zA-Z]*=".*?"/g
还可以添加 i 标志并将 [a-zA-Z] 替换为 [a-z] 。另请注意，' 和 " 都对属性值字符串有效。你可以试试这个正则表达式\s*[a-zA-Z]*=["'].*?["']\s*，因为如果它存在的话，它也会替换属性前后的空白。
我不知道你为什么需要.*?。这对我来说似乎是无效的正则表达式。它与.* 有何不同？

标签： javascript html regex

【解决方案1】：

关于为什么使用正则表达式解析 HTML 可能会有很大风险，有很多文献 - this famous StackOverflow question 就是一个很好的例子。

正如@Polymer 指出的那样，您当前的正则表达式将丢失带有单引号的属性，但也有其他可能性：data 属性 - 例如 data-id="233" 将丢失，还有非引号属性，例如 @987654324 @。可能还有更多！

您最终会一直在追赶这种方法，当您在 HTML 中遇到新的组合时，您总是不得不更改您的正则表达式。

更安全的方法可能是使用 DOMParser 方法将您的字符串解析为 HTML，并以这种方式从中提取内容：

let stringhtml = '<div class="Paragraph  BCX0 SCXW244271589" paraid="1364880375" paraeid="{8e523337-60c9-4b0d-8c73-fb1a70a2ba58}{165}" style="margin-bottom: 0px;margin-left:96px;padding:0px;user-select:text;-webkit-user-drag:none;-webkit-tap-highlight-color:transparent; overflow-wrap: break-word;">some text</div>'

let parser = new DOMParser();
let parsedResult = parser.parseFromString(stringhtml, 'text/html');

let element = document.createElement(parsedResult.body.firstChild.tagName);

element.innerText = parsedResult.documentElement.textContent;

console.log(element);

【讨论】：