【问题标题】:Using JSoup to parse a String with Clojure使用 JSoup 通过 Clojure 解析字符串
【发布时间】:2021-01-06 07:37:07
【问题描述】:

使用JSoup用Clojure解析一个html字符串,源码如下

依赖关系

:dependencies [[org.clojure/clojure "1.10.1"]
               [org.jsoup/jsoup "1.13.1"]]

源代码

(require '[clojure.string :as str])
(def HTML (str "<html><head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"))

(defn fetch_html [html]
  (let [soup (Jsoup/parse html)
        titles (.title soup)
        paragraphs (.getElementsByTag soup "p")]
    {:title titles :paragraph paragraphs}))

(fetch_html HTML)

预期结果

{:title "Website title", 
 :paragraph ["Sample paragraph number 1" 
             "Sample paragraph number 2"]}

很遗憾,结果不如预期

user ==> (fetch_html HTML)
{:title "Website title", :paragraph []}

【问题讨论】:

  • 您是否尝试将“p”而不是“a”传递给 getElementsByTag 方法?
  • 能否请您更具体地了解使用的版本等。您的代码WFM。不需要使用str(导入也不需要)-但这不应该损害结果。
  • @cfrick :dependencies [[org.clojure/clojure "1.10.1"] [org.jsoup/jsoup "1.13.1"]],谢谢

标签: java web-scraping clojure jsoup


【解决方案1】:

(.getElementsByTag ...) 返回元素的序列,您需要在每个元素上调用 .text() 方法来获取文本值。我正在使用 Jsoup 版本 1.13.1。


(ns core
  (:import (org.jsoup Jsoup))
  (:require [clojure.string :as str]))

(def HTML (str "<html><head><title>Website title</title></head>
                <body><p>Sample paragraph number 1 </p>
                      <p>Sample paragraph number 2</p>
                </body></html>"))

(defn fetch_html [html]
  (let [soup (Jsoup/parse html)
        titles (.title soup)
        paragraphs (.getElementsByTag soup "p")]
    {:title titles :paragraph (mapv #(.text %) paragraphs)}))

(fetch_html HTML)

还可以考虑使用 Reaver,它是一个包装 JSoup 的 Clojure 库,或者其他人建议的任何其他包装器。

【讨论】:

    【解决方案2】:

    我有一个可能有用的Clojure wrapper for TagSoup。尝试在这个template project 中运行它。要在您的项目中使用,请添加以下行:

    [tupelo "21.01.05"]
    

    到你的:dependenciesproject.clj


    代码示例:

    (ns tst.demo.core
      (:use demo.core tupelo.core tupelo.test)
      (:require
        [tupelo.parse.tagsoup :as tagsoup]
        ))
    
    (dotest
      (let [html "<html>
                    <head><title>Website title</title></head>
                    <body><p>Sample paragraph number 1 </p>
                          <p>Sample paragraph number 2</p>
                    </body></html>"]
        (is= (tagsoup/parse html)
          {:tag     :html,
           :attrs   {},
           :content [{:tag     :head,
                      :attrs   {},
                      :content [{:tag :title, :attrs {}, :content ["Website title"]}]}
                     {:tag     :body,
                      :attrs   {},
                      :content [{:tag :p, :attrs {}, :content ["Sample paragraph number 1 "]}
                                {:tag :p, :attrs {}, :content ["Sample paragraph number 2"]}]}]})))
    

    详情

    如果您查看源代码,您会很容易明白为什么要使用包装函数!

    (ns tupelo.parse.tagsoup
      (:use tupelo.core)
      (:require
        [schema.core :as s]
        [tupelo.parse.xml :as xml]
        [tupelo.string :as ts]
        [tupelo.schema :as tsk]))
    
    (s/defn ^:private tagsoup-parse-fn
      [input-source :- org.xml.sax.InputSource
       content-handler]
      (doto (org.ccil.cowan.tagsoup.Parser.)
        (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/default-attributes" false)
        (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/cdata-elements" true)
        (.setFeature "http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace" true)
        (.setContentHandler content-handler)
        (.setProperty "http://www.ccil.org/~cowan/tagsoup/properties/auto-detector"
          (proxy [org.ccil.cowan.tagsoup.AutoDetector] []
            (autoDetectingReader [^java.io.InputStream is]
              (java.io.InputStreamReader. is "UTF-8"))))
        (.setProperty "http://xml.org/sax/properties/lexical-handler" content-handler)
        (.parse input-source)))
    
    ; #todo make use string input:  (ts/string->stream html-str)
    (s/defn parse-raw :- tsk/KeyMap
      "Loads and parse an HTML resource and closes the input-stream."
      [html-str :- s/Str]
      (xml/parse-raw-streaming
        (org.xml.sax.InputSource.
          (ts/string->stream html-str))
        tagsoup-parse-fn))
    
    ; #todo make use string input:  (ts/string->stream html-str)
    (s/defn parse :- tsk/KeyMap
      "Loads and parse an HTML resource and closes the input-stream."
      [html-str :- s/Str]
      (xml/enlive-remove-whitespace
        (xml/enlive-normalize
          (parse-raw
            html-str))))
    

    【讨论】:

    • 我将尝试嵌入到我的代码中。感谢您的回答,
    • 请查看更新。不要嵌入,只需通过project.clj中的引用使用该库
    猜你喜欢
    • 2016-05-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-09-30
    • 2013-08-10
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多