【问题标题】:How to extract a string from a paragraph?如何从段落中提取字符串?
【发布时间】:2013-12-07 05:58:37
【问题描述】:
A 24-year-old youth died on the spot, after his motorcycle
 rammed a divider near Golf market on <LOCATION>BelAir</LOCATION> road 
 Thursday night. The deceased has been identified as
 John(24) hailing from <LOCATION>UK</LOCATION>.

He was originally from <LOCATION>Usa</LOCATION>.

这些句子是 2 个不同的段落。我希望输出看起来像:

Para 1:BelAir 
       UK

Para 2:Usa

我已将标签的正则表达式标识为:

<(?<tag>\w*)>(?<text>.*)</\k<tag>>

对于段落来说:

(\n|^).*?(?=\n|$)

有没有办法把这些结合起来?或者我应该使用拆分吗?

【问题讨论】:

  • 这是嵌入在某种 HTML 或其他标记中,还是独立的?
  • 不是独立的。实际上它是 stanfords ner tagger 的输出

标签: java regex text-extraction


【解决方案1】:

试试这个

String str = "A 24-year-old youth died on the spot, after his motorcycle " +
            "rammed a divider near Golf market on <LOCATION>BelAir</LOCATION> road" +
            " Thursday night. The deceased has been identified as  John(24) hailing from <LOCATION>UK</LOCATION>." +
            "\n He was originally from <LOCATION>Usa</LOCATION>.";
    String [] paras=str.split("\n"); //Divide the string into two paragraphs
    Pattern pattern = Pattern.compile("<LOCATION>(.*?)</LOCATION>");
        for(int i=0;i<paras.length;i++)
        {
            System.out.print("Para "+(i+1)+": ");
            Matcher matcher = pattern.matcher(paras[i]);
            while (matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }

输出将是

Para 1: BelAir
UK
Para 2: Usa

【讨论】:

  • 谢谢我试试这个
  • 事情是,我不能手动放一个\n。我从文本文件中读取内容,我想识别每个段落中的所有位置
【解决方案2】:

检查字符串是否以'\n'开头

while(){//read line
   if(string.startsWith("\n")==false){
     // your regex expration for tags
     // store it in a list
   }
   else{
     // add a null in a List 
   }
}

所以你的列表看起来像

BelAir
US
Null
USA

所以在每个 null 之后都有一个新的 Para

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2011-07-27
    • 1970-01-01
    • 1970-01-01
    • 2014-01-07
    • 2022-01-25
    • 2020-09-15
    • 2011-10-13
    相关资源
    最近更新 更多