【问题标题】:How to improve text processing performance in Clojure?如何提高 Clojure 中的文本处理性能?
【发布时间】:2013-04-21 20:11:53
【问题描述】:

我正在用 Clojure 编写一个简单的桌面搜索引擎,作为了解该语言更多信息的一种方式。到现在为止,我的程序在文本处理阶段的性能真的很差。

在文本处理期间,我必须:

  • 清理不需要的字符;
  • 将字符串转换为小写;
  • 拆分文档以获取单词列表;
  • 构建一个地图,将每个单词与其在文档中的出现关联起来。

代码如下:

(ns txt-processing.core
  (:require [clojure.java.io :as cjio])
  (:require [clojure.string :as cjstr])
  (:gen-class))

(defn all-files [path]
  (let [entries (file-seq (cjio/file path))]
    (filter (memfn isFile) entries)))

(def char-val
  (let [value #(Character/getNumericValue %)]
    {:a (value \a) :z (value \z)
     :A (value \A) :Z (value \Z)
     :0 (value \0) :9 (value \9)}))

(defn is-ascii-alpha-num [c]
  (let [n (Character/getNumericValue c)]
    (or (and (>= n (char-val :a)) (<= n (char-val :z)))
        (and (>= n (char-val :A)) (<= n (char-val :Z)))
        (and (>= n (char-val :0)) (<= n (char-val :9))))))

(defn is-valid [c]
    (or (is-ascii-alpha-num c)
        (Character/isSpaceChar c)
        (.equals (str \newline) (str c))))

(defn lower-and-replace [c]
  (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))

(defn tokenize [content]
  (let [filtered (filter is-valid content)
        lowered (map lower-and-replace filtered)]
    (cjstr/split (apply str lowered) #"\s+")))

(defn process-content [content]
  (let [words (tokenize content)]
    (loop [ws words i 0 hmap (hash-map)]
      (if (empty? ws)
        hmap
        (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))

(defn -main [& args]
  (doseq [file (all-files (first args))]
    (let [content (slurp file)
          oc-list (process-content content)]
      (println "File:" (.getPath file)
               "| Words to be indexed:" (count oc-list )))))

因为我在 Haskell 中有 another implementation 这个问题,所以我比较了两者,您可以在以下输出中看到。

Clojure 版本:

$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    2m2.164s
user    2m3.868s
sys     0m0.978s

Haskell 版本:

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main             ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    0m9.086s
user    0m8.591s
sys     0m0.463s

我认为 Clojure 实现中的 (string -> lazy sequence) 转换正在扼杀性能。我该如何改进它?

P.S:这些测试中用到的所有代码和数据都可以在here下载。

【问题讨论】:

  • jar 的 JVM 启动时间是多少? Haskell 是否有类似的开销?
  • 我机器上的 JVM 启动时间大约是一秒。我认为 Haskell 由于运行时系统(RTS)有一些开销,但我应该比 JVM 低很多。
  • s/我应该/应该/
  • 使用(inc i) 而不是(+ i 1)
  • @luisgabriel:你能发布最终版本和结果吗?

标签: clojure text-processing lazy-sequences


【解决方案1】:

你可以做的一些事情可能会加速这段代码:

1) 不要将chars 映射到char-val,只需在字符之间进行直接值比较。这更快,原因与它在 Java 中更快的原因相同。

2) 您反复使用str 将单字符值转换为成熟的字符串。同样,考虑直接使用字符值。同样,对象创建很慢,与 Java 相同。

3) 您应该将process-content 替换为clojure.core/frequencies。或许检查frequencies 来源,看看它是如何更快的。

4) 如果您必须在循环中更新(hash-map),请使用transient。见:http://clojuredocs.org/clojure_core/clojure.core/transient

还要注意(hash-map) 返回一个PersistentArrayMap,因此每次调用update-in 都会创建新实例 - 因此速度很慢,为什么应该使用瞬态。

5) 这是你的朋友:(set! *warn-on-reflection* true) - 你有很多可以从type hints 中受益的反思

 Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
 Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
 Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.

【讨论】:

  • 完美!只需输入类型提示总时间减少到 7s! ;)
  • 我不能使用clojure.core/frequencies,因为我需要单词位置来进一步进行索引和查询等阶段。
【解决方案2】:

为了比较,这里是一个基于正则表达式的 Clojure 版本

(defn re-index
     "Returns lazy sequence of vectors of regexp matches and their start index" 
     [^java.util.regex.Pattern re s]
     (let [m (re-matcher re s)]
       ((fn step []
          (when (. m (find))
            (cons (vector (re-groups m)(.start m)) (lazy-seq (step))))))))

(defn group-by-keep
  "Returns a map of the elements of coll keyed by the result of
  f on each element. The value at each key will be a vector of the
  results of r on the corresponding elements."
  [f r coll]  
  (persistent!
    (reduce
      (fn [ret x]
        (let [k (f x)]
          (assoc! ret k (conj (get ret k []) (r x)))))
      (transient {}) coll)))

(defn word-indexed
  [s]
  (group-by-keep
    (comp clojure.string/lower-case first)
    second
    (re-index #"\w+" s)))

【讨论】:

    猜你喜欢
    • 2019-09-14
    • 1970-01-01
    • 2013-01-16
    • 2021-11-19
    • 2018-07-11
    • 1970-01-01
    • 1970-01-01
    • 2011-06-22
    • 1970-01-01
    相关资源
    最近更新 更多