试图从java中的url中提取内容答案

【问题标题】：Trying to extract content from url in java试图从java中的url中提取内容
【发布时间】：2014-02-26 13:58:21
【问题描述】：

我正在尝试从 URL 中提取网页内容。我已经编写了代码，但我认为我在正则表达式部分犯了一个错误。当我运行代码时，只有第一行出现在控制台中。我正在使用NetBeans。我已经拥有的代码：

private static String text;
public static void main(String[]args){
URL u;
  InputStream is = null;
  DataInputStream dis;
  String s;

  try {

     u = new URL("http://ghr.nlm.nih.gov/gene/AKT1 ");

     is = u.openStream();         

     dis = new DataInputStream(new BufferedInputStream(is));


     text="";
     while ((s = dis.readLine()) != null) {
        text+=s;
     }

  } catch (MalformedURLException mue) {

     System.out.println("Ouch - a MalformedURLException happened.");
     mue.printStackTrace();
     System.exit(1);

  } catch (IOException ioe) {

     System.out.println("Oops- an IOException happened.");
     ioe.printStackTrace();
     System.exit(1);

  } finally {


      String pattern = "(?i)(<P>)(.+?)";
         System.out.println(text.split(pattern)[1]);

     try {
        is.close();
     } catch (IOException ioe) {

     }

  } 

}
}

【问题讨论】：

强制 不鼓励使用正则表达式来解析 html，请使用 html 解析 API，例如 jsoup 注释

标签： java regex

【解决方案1】：

考虑通过专用的html 解析API（如jsoup）来提取您的网页信息。使用您的 url 提取所有带有 <p> 标签的元素的简单示例是：

public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://ghr.nlm.nih.gov/gene/AKT1")
                    .get();
            Elements els = doc.select("p");

            for (Element el : els) {
                System.out.println(el.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

控制台：

On this page:
The official name of this gene is “v-akt murine thymoma viral oncogene homolog 1.”
AKT1 is the gene's official symbol. The AKT1 gene is also known by other names, listed below.
Read more about gene names and symbols on the About page.
The AKT1 gene provides instructions for making a protein called AKT1 kinase. This protein is found in various cell types throughout the body, where it plays a critical role in many signaling pathways. For example, AKT1 kinase helps regulate cell growth and division (proliferation), the process by which cells mature to carry out specific functions (differentiation), and cell survival. AKT1 kinase also helps control apoptosis, which is the self-destruction of cells when they become damaged or are no longer needed.
...

【讨论】：

我已经放了这段代码并添加了jsoup API。但我认为有错误。
@stella 除非您缺少一个或其他依赖项，否则不应该存在 - 按原样对我有用，您会遇到什么具体错误？
@popoFibo 谢谢它的工作正常。但现在的问题是文本应该保存在我的数据库中。我将如何捕捉这些文本？
@stella 当然，我正在将数据打印到控制台上，您可以尝试将el.text() 存储到 Java 集合（如列表或简单的字符串数组）中，然后对其进行迭代并插入您的表格或 StringBuilder 对象
@PopoFibo like that String content[]=e1.text();

【解决方案2】：

在字符串连接过程中缺少换行符。
在读取每一行后，在 text 后面附加一个 new line 字符。

变化：

while ((s = dis.readLine()) != null) {
    text+=s;
}

收件人：

while ((s = dis.readLine()) != null) {
    text += s + "\n";
}

我建议您使用StringBulder 而不是String 来构建最终文本。

StringBuilder text = new StringBuilder( 1024 );
...
while ((s = dis.readLine()) != null) {
    text.append( s ).append( "\n" );
}

...
System.out.println( text.toString() );

【讨论】：