如何通过网页搜索以查找特定文本答案

【问题标题】：How to search through a webpage to find a certain text如何通过网页搜索以查找特定文本
【发布时间】：2014-05-22 02:34:11
【问题描述】：

我目前正在编写一个类似于食谱的 java 程序。我已经建立了所有东西，但不幸的是，我没有食谱。

我四处寻找，发现http://allrecipes.com/。我查看了来源，发现了包含成分、食谱和营养成分的行。

我记得在终端中使用了grep，我很快发现lynx 很有用。这是我到目前为止所拥有的（对于示例页面）。

第一次提到成分后获得100行：lynx -dump "http://allrecipes.com/Recipe/Potato-Crunchy-Tenders/" | grep -n -A 100 "Ingredients"

获取“成分”的行号：lynx -dump "http://allrecipes.com/Recipe/Beef-Tips-and-Noodles/" | grep -n "Ingredients" | cut -f1 -d:

我做了几个例子，发现食谱在“Ingredients”行之后的 6 行开始，每隔一行一个新的成分，像这样：

"135:Ingredients [66]编辑并保存

136-

137- 原始食谱制作 6 份 [67]更改份

138- 制作 6___________________ 份 (*) 美国 ( ) 公制 [68]调整食谱

139-（[69]帮助）

140- * [ ]

141- 1/2 杯油炸植物油

142- * [ ]

143- 1 1/2 杯牛奶

144- * [ ]

145- 1 个鸡蛋

146- * [ ]

147- 1（7.6 盎司）包装大蒜味速溶土豆泥 "

我的目标是以某种方式获取文本文件中的成分，以便我可以用 java 解析（我对此很满意）。我希望对食谱做同样的事情。

这样，我可以为许多食谱自动执行此操作，因此我不必手动获取所有这些。

有没有更简单的java方法呢？

干杯。

【问题讨论】：

您可能想要解析网站的 html，jsoup 等工具可以帮助您。如果这是我的项目，我会在 Google 上搜索，下载并试一试。

标签： java terminal

【解决方案1】：

感谢 Hovercraft Full Of Eels，我研究了 JSoup，它运行良好。

今晚我已经解决了尽可能多的问题，这是我想出的代码。

要获得listOfIngredients（扩展ArrayList<Ingredient>）：

public static ListOfIngredients getListOfIngredients(final String html) {

    ListOfIngredients tmp = new ListOfIngredients();
    try {
        Element body = Jsoup.connect(html).get().body();

        try {
            for (Element elem : body.getElementsByAttributeValue("itemprop", "ingredients")) {
                Elements ingredientAmtElements = elem.getElementsByClass("ingredient-amount");
                String amount = null;
                if (!ingredientAmtElements.isEmpty()) {
                    amount = ingredientAmtElements.first().text();
                }
                String ingredient = elem.getElementsByClass("ingredient-name").first().text();
                if (!ingredient.equals("\u00a0")) {
                    tmp.add(new Ingredient(amount, ingredient));
                }
            }
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
    } catch (IOException e1) {
        e1.printStackTrace();
    }

    return tmp;
}

要获得Instructions（扩展ArrayList<String>）：

public static Instructions getInstructions(final String html) { Instructions instr = new Instructions(); try { Element body = Jsoup.connect(html).get().body(); Element elem = body.getElementsByAttributeValue("itemprop", "recipeInstructions").first(); for (Element e : elem.getElementsByTag("li")) { instr.add(e.text()); } } catch (IOException e) { e.printStackTrace(); } return instr; }

【讨论】：