【问题标题】:Splitting strings with words with any kind of characters inside as whole words将字符串与包含任何类型字符的单词拆分为整个单词
【发布时间】:2022-07-06 15:36:45
【问题描述】:

尝试从文本中正确提取所有术语。看起来当 term 在句子内并且 term 包含 () 它没有被拆分并且正则表达式找不到它。

我正在尝试正确拆分包含 () 的匹配项。所以代替这个:

["What is API(Application Programming Interface) and how to use it?"]

我正在努力解决这个问题:

["What is", "API(Application Programming Interface)", "and how to use it?"]

JSON 术语被正确提取,我得到了这个:

["JSON", "is a Javascript Object Notation"] 所以这正是我想要的,但如果是 API,我没有得到这个:

["What is", "API(Application Programming Interface)", "and how to use it?"]

我得到了这个,这不是我想要的:

["What is API(Application Programming Interface) and how to use it?"]

function getAllTextNodes(element) {
    let node;
    let nodes = [];
    let walk = document.createTreeWalker(element,NodeFilter.SHOW_TEXT,null,false);
    while (node = walk.nextNode()) nodes.push(node);
    return nodes;
  }

const allNodes = getAllTextNodes(document.getElementById("body"))

const terms = [
    {id: 1, definition: 'API stands for Application programming Interface', expression: 'API(Application Programming Interface)'},
    {id: 2, definition: 'JSON stands for JavaScript Object Notation.', expression: 'JSON'}
]

const termMap = new Map(
      [...terms].sort((a, b) => b.expression.length - a.expression.length)
                .map(term => [term.expression.toLowerCase(), term])
    );

const regex = RegExp("\\b(" + Array.from(termMap.keys()).join("|") + ")\\b", "ig");

for (const node of allNodes) {
    const pieces = node.textContent.split(regex).filter(Boolean);
    console.log(pieces)
}
<div id="body">
    <p>API(Application Programming Interface)</p>
    <p>What is API(Application Programming Interface) and how to use it?</p>
    <p>JSON is a Javascript Object Notation</p>
</div>

【问题讨论】:

  • 问题/问题是?到目前为止,您尝试过什么来自己解决这个问题? -> How do I ask a good question?
  • How do I ask a good question?:"写一个总结具体问题的标题"
  • @Andreas 对此感到抱歉。所以我创建了正则表达式来匹配#body 中的所有术语,并将每个节点正确拆分为数组。所以我唯一的问题是当术语包含() 时如何正确拆分句子
  • 转义正则表达式中的术语。如果你可以在字符串的开头/结尾有特殊字符,你就不能使用\b 字边界。

标签: javascript regex


【解决方案1】:

由于您的“单词”可以由非单词字符组成,因此您不能依赖单词边界。我建议切换到明确的 ((?&lt;!\w)/(?!\w)) 或 adaptive dynamic word boundaries

此外,在正则表达式中使用之前,您需要escape your terms

请参见下面的自适应单词边界示例:

function getAllTextNodes(element) {
    let node;
    let nodes = [];
    let walk = document.createTreeWalker(element,NodeFilter.SHOW_TEXT,null,false);
    while (node = walk.nextNode()) nodes.push(node);
    return nodes;
  }

const allNodes = getAllTextNodes(document.getElementById("body"))

const terms = [
    {id: 1, definition: 'API stands for Application programming Interface', expression: 'API(Application Programming Interface)'},
    {id: 2, definition: 'JSON stands for JavaScript Object Notation.', expression: 'JSON'}
]

const termMap = new Map(
      [...terms].sort((a, b) => b.expression.length - a.expression.length)
                .map(term => [term.expression.toLowerCase(), term])
    );

const regex = RegExp("(?!\\B\\w)(" + Array.from(termMap.keys()).map(x => x.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')).join("|") + ")(?<!\\w\\B)", "ig");

for (const node of allNodes) {
    const pieces = node.textContent.split(regex).filter(Boolean);
    console.log(pieces)
}
<div id="body">
    <p>API(Application Programming Interface)</p>
    <p>What is API(Application Programming Interface) and how to use it?</p>
    <p>JSON is a Javascript Object Notation</p>
</div>

正则表达式现在是(?!\B\w)(api\(application programming interface\)|json)(?&lt;!\w\B),在哪里

  • (?!\B\w) - 左侧自适应单词边界(如果后面的字符是非单词字符,则不进行上下文检查)
  • (api\(application programming interface\)|json) - 第 1 组匹配您的一个字词(请参阅转义特殊字符)
  • (?&lt;!\w\B) - 右手自适应单词边界(如果前面的字符是非单词字符,则不进行上下文检查)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2011-06-12
    • 1970-01-01
    • 2023-04-03
    • 1970-01-01
    • 2022-01-18
    • 2014-06-09
    • 1970-01-01
    相关资源
    最近更新 更多