【问题标题】:Want to Remove Markup's from the Annotation-UIMA RUTA想要从注释中删除标记-UIMA RUTA
【发布时间】:2016-09-05 16:47:24
【问题描述】:

如果我使用 P 标记(来自 Html Annotator)作为 PASSAGE。我想忽略注释中的标记。

脚本:

//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};

DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};

DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("&lt;")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP("&gt;")->MARK(RANGLEBRACKET)};

DECLARE LBracket,RBracket;

(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};


DECLARE PASSAGE,TESTPASSAGE;

       "<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;

 RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
 PASSAGE{-> TRIM(WS)};
 RETAINTYPE;

  PASSAGE{->MARK(TESTPASSAGE)};



DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)}; 


 BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}   
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};


BLOCK(foreach)PASSAGE{}
{
InitialTag  ANY+{->SHIFT(PASSAGE,2,2)};

}

示例输入:

<p class="Normal"><a name="para1"><h1><b>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </b></a></p>

<p class="Normal"><a name="para2"><aus>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para3">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para4">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </a></p>

<p class="Normal"><a name="para5">On the Insert tab, the <span>galleries</span> include items that are designed to coordinate with the overall look of your document.</a></p>

PASSAGE(5) AND TESTPASSAGE(2)。为什么 TESTPASSAGE 减少了?并且 InitialTag 没有被标记。

我已附上输出注释图像

【问题讨论】:

  • 类似FILTERTYPE(P);?
  • 您能否提供一个有效的 html 以使示例可重现? HtmlAnnotaotr 在尝试解析时抛出异常。

标签: uima ruta


【解决方案1】:

在重现给定示例时,我得到 5 个 PASSAGE 注释和 3 个 TESTPASSAGE 注释(最后三个 PASSAGE 注释)。另外两个PASSAGE注解不用TESTPASSAGE注解,因为它们以MARKUP注解开头,默认不可见,使完整的注解不可见。为了避免这个问题,您可以使 MARKUP 可见或从 PASSAGE 注释中修剪标记(这实际上是主要问题吗?)。只需扩展 TRIM 操作的规则即可:

RETAINTYPE(WS, MARKUP);
PASSAGE{-> TRIM(WS, MARKUP)};
RETAINTYPE;

没有 InitialTag 注释,因为没有 TagContent 注释,因为示例中没有 LBracket 注释。

顺便说一句,你可以重写一些规则:

PASSAGE{->MARKFIRST(PassageFirstToken)};

(LBracket # RBracket){-PARTOF(TagContent)-> TagContent}; 

免责声明:我是 UIMA Ruta 的开发人员

【讨论】:

    【解决方案2】:
      //-------------------------------------------------------------------
    // SPECIAL SQUARE HYPHEN PARENTHESIS
    //-------------------------------------------------------------------
    DECLARE LParen, RParen;
    SPECIAL{REGEXP("[(]") -> MARK(LParen)};
    SPECIAL{REGEXP("[)]") -> MARK(RParen)};
    
    DECLARE LSQParen, RSQParen;
    SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
    SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};
    
    DECLARE LANGLEBRACKET,RANGLEBRACKET;
    SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
    AMP{REGEXP("&lt;")->MARK(LANGLEBRACKET)};
    SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
    AMP{REGEXP("&gt;")->MARK(RANGLEBRACKET)};
    
    DECLARE LBracket,RBracket;
    
    (LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
    (RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};
    
    
    DECLARE PASSAGE,TESTPASSAGE;
    
           "<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;
    
     RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
     PASSAGE{-> TRIM(WS)};
     RETAINTYPE;
    
      PASSAGE{->MARK(TESTPASSAGE)};
    
    
    
    DECLARE TagContent,PassageFirstToken,InitialTag;
    LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)}; 
    
    
     BLOCK(foreach)PASSAGE{}
    {
    Document{->MARKFIRST(PassageFirstToken)};
    }   
    TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};
    
    
    BLOCK(foreach)PASSAGE{}
    {
    InitialTag  ANY+{->SHIFT(PASSAGE,2,2)};
    
    }
    

    【讨论】:

    • 我需要忽略 PASSAGE 中的 InitialTags。
    • 如果我在 PASSAGE 之间有跨度标签。我可以忽略输出注释中的跨度标签吗?例如:一些文本Hi一些文本。输出:一些文本嗨一些文本
    猜你喜欢
    • 2018-08-27
    • 2017-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多