【问题标题】：Is there a function in the included Clojure libraries to split a string around another string?包含的 Clojure 库中是否有一个函数可以将一个字符串拆分为另一个字符串？
【发布时间】：2023-03-05 17:29:01
【问题描述】：

我知道在clojure.string 中有一个split 函数，它返回字符串中不包括给定模式的部分序列。

(require '[clojure.string :as str-utils])
(str-utils/split "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " ", this is dog yes " " it is me"]

但是，我正在尝试找到一个函数，该函数将令牌作为元素保留在返回的向量中。所以它会像

(split-around "Yes, hello, this is dog yes hello it is me" #"hello")
;; -> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]

在任何包含的库中是否有执行此操作的函数？任何在外部库中？我一直在尝试自己编写它，但一直无法弄清楚。

【问题讨论】：

返回的数组中的任意两个相邻项是否都隐含了缺少的单词？你试图做什么？
好点！这是最好的答案。
@Shlomi 是的，它是隐式的，但我需要在返回的 vec 中拆分字符串。在这种情况下，因为被拆分的正则表达式只是一个单词，是的。但是说正则表达式是\[\[.*?\]\]。在这种情况下，很有可能会有 [[hello]] 和 [[yes]] 这样的东西，我需要知道匹配的文本是什么以及它在字符串中的位置。

标签： string split clojure

【解决方案1】：

(-> "Yes, hello, this is dog yes hello it is me"
    (str/replace #"hello" "~hello~")
    (str/split #"~"))

【讨论】：

如果搜索到的字符串正好在输入的开头，这将添加额外的空字符串。
是的，leetwinski 做得更好了。
当输入包含~时失败。

【解决方案2】：

使用@Shlomi 解决方案的示例：

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require [clojure.string :as str]))

(dotest
  (let [input-str "Yes, hello, this is dog yes hello it is me"
        segments  (mapv str/trim
                    (str/split input-str #"hello"))
        result    (interpose "hello" segments)]
    (is= segments ["Yes," ", this is dog yes" "it is me"])
    (is= result ["Yes," "hello" ", this is dog yes" "hello" "it is me"])))

更新

最好为这个用例编写一个自定义循环。比如：

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require
    [clojure.string :as str] ))

(defn strseg
  "Will segment a string like '<a><tgt><b><tgt><c>' at each occurrence of `tgt`, producing
   an output vector like [ <a> <tgt> <b> <tgt> <c> ]."
  [tgt source]
  (let [tgt-len  (count tgt)
        segments (loop [result []
                        src    source]
                   (if (empty? src)
                     result
                     (let [i (str/index-of src tgt)]
                       (if (nil? i)
                         (let [result-next (into result [src])
                               src-next    nil]
                           (recur result-next src-next))
                         (let [pre-tgt     (subs src 0 i)
                               result-next (into result [pre-tgt tgt])
                               src-next    (subs src (+ tgt-len i))]
                           (recur result-next src-next))))))
        result   (vec
                   (remove (fn [s] (or (nil? s)
                                     (empty? s)))
                     segments))]
    result))

带有单元测试

(dotest
  (is= (strseg "hello" "Yes, hello, this is dog yes hello it is me")
    ["Yes, " "hello" ", this is dog yes " "hello" " it is me"] )
  (is= (strseg "hello" "hello")
    ["hello"])
  (is= (strseg "hello" "") [])
  (is= (strseg "hello" nil) [])
  (is= (strseg "hello" "hellohello") ["hello" "hello" ])
  (is= (strseg "hello" "abchellodefhelloxyz") ["abc" "hello" "def" "hello" "xyz" ])
  )

【讨论】：

在这种情况下，您将无法区分空字符串和“hello”。
当tgt 为空且source 不为空时，strseg 不会终止。
好点！用户应该为此添加错误检查并抛出异常。

【解决方案3】：

您还可以为此使用正则表达式前瞻/后视功能：

user> (clojure.string/split "Yes, hello, this is dog yes hello it is me" #"(?<=hello)|(?=hello)")
;;=> ["Yes, " "hello" ", this is dog yes " "hello" " it is me"]

您可以将其理解为“在前面或后面的单词是 'hello' 的点处用零长度字符串拆分”

注意，它还会忽略相邻模式和前导/尾随模式的悬空空字符串：

user> (clojure.string/split "helloYes, hello, this is dog yes hellohello it is mehello" #"(?<=hello)|(?=hello)")
;;=> ["hello"
;;    "Yes, "
;;    "hello"
;;    ", this is dog yes "
;;    "hello"
;;    "hello"
;;    " it is me"
;;    "hello"]

你可以把它包装成这样的函数，例如：

(defn split-around [source word]
  (let [word (java.util.regex.Pattern/quote word)]
    (->> (format "(?<=%s)|(?=%s)" word word)       
         re-pattern
         (clojure.string/split source))))

【讨论】：

我认为这是最好的答案
如果要拆分的单词包含正则表达式元字符，您的 split-around 将失败。如果您想概括“hello”之外的内容，请使用 Pattern/quote 引用元字符。
@amalloy 足够公平
这并不总是产生正确的输出。例如。 (split-around "hahaha" "haha") 产生 ["ha" "ha" "ha"] 但是，基于 split 的行为，它应该产生 ["" "haha" "ha"] 或 ["haha" "ha"]。
另外请注意，同时使用前瞻和后瞻会使该解决方案效率稍低。例如，大多数出现的分隔符将被识别两次。

【解决方案4】：

这是另一种解决方案，可以避免 leetwinski 的答案中出现的重复模式和双重识别问题（请参阅我的 cmets），并且还可以尽可能延迟地计算部分：

(defn partition-str [s sep]
  (->> s
       (re-seq
         (->> sep
              java.util.regex.Pattern/quote ; remove this to treat sep as a regex
              (format "((?s).*?)(?:(%s)|\\z)")
              re-pattern))
       (mapcat rest)
       (take-while some?)
       (remove empty?))) ; remove this to keep empty parts

但是当分隔符是/匹配空字符串时，它的行为不正确/直观。

另一种方法是使用相同模式的re-seq 和split 并交错生成的序列，如this related question 所示。不幸的是，这样每次出现的分隔符都会被识别两次。

也许更好的方法是使用re-matcher 和re-find 在更原始的基础上进行构建。

最后，为了更直接地回答最初的问题，Clojure 的标准库或任何外部库 AFAIK 中都没有这样的功能。此外，我不知道有任何简单且完全没有问题的解决方案来解决这个问题（尤其是使用正则表达式分隔符）。

更新

这是我现在能想到的最佳解决方案，在较低级别上工作，懒惰地使用正则表达式分隔符：

(defn re-partition [re s]
  (let [mr (re-matcher re s)]
    ((fn rec [i]
       (lazy-seq
         (if-let [m (re-find mr)]
           (list* (subs s i (.start mr)) m (rec (.end mr)))
           (list (subs s i)))))
     0)))

(def re-partition+ (comp (partial remove empty?) re-partition))

注意我们可以（重新）定义：

(def re-split (comp (partial take-nth 2) re-partition))

(def re-seq (comp (partial take-nth 2) rest re-partition))

【讨论】：