【问题标题】:dictionary expansion in RR中的字典扩展
【发布时间】:2025-12-10 19:15:01
【问题描述】:

我正在寻找一种快速有效的解决方案来扩展字典 (df1)

                 pattern cat1 cat2
1          I want [food]    a    b
2 I'm [amplifier] [pos].    a    b

df1 <- data.frame(pattern=c("I want [food]", "I'm [amplifier] [pos]"),
                      cat1=c("a", "c"), cat2=c("b", "d"), stringsAsFactors=FALSE)

具有字符串模式,其中一些类别包含在方括号 [] 中。这些表示以字典格式 (df2) 出现在附加数据框中的类别。

     pattern  category
1      pizza      food
2    hot dog      food
3      chips      food
4       very amplifier
5  very much amplifier
6      happy       pos
7 optimistic       pos

df2 <- structure(list(pattern = c("pizza", "hot dog", "chips", "very", 
"very much", "happy", "optimistic"), category = c("food", "food", 
"food", "amplifier", "amplifier", "pos", "pos")), .Names = c("pattern", 
"category"), row.names = c(NA, -7L), class = "data.frame")

我想创建一个扩展的 data.frame,它采用 df 1 并用 df 2 扩展它,所以它看起来像这样:

                   pattern cat1 cat2
1             I want pizza    a    b
2            I want hotdog    a    b
3             I want chips    a    b
4           I'm very happy    c    d
5      I'm much more happy    c    d
6      I'm very optimistic    c    d
7 I'm much more optimistic    c    d

output <- structure(list(pattern = c("I want pizza", "I want hotdog", "I want chips", 
"I'm very happy", "I'm much more happy", "I'm very optimistic", 
"I'm much more optimistic"), cat1 = c("a", "a", "a", "c", "c", 
"c", "c"), cat2 = c("b", "b", "b", "d", "d", "d", "d")), .Names = c("pattern", 
"cat1", "cat2"), row.names = c(NA, -7L), class = "data.frame")

【问题讨论】:

    标签: regex r dictionary data.table


    【解决方案1】:

    这是我要做的:

    library(stringi)
    library(data.table)
    setDT(df1)
    setDT(df2)
    
    capture_patt = "\\[(\\w+)\\]"
    df1[, {
        cats = stri_match_all(pattern, regex = capture_patt)[[1]][, 2]
        new_patt = gsub(capture_patt, "%s", pattern)
    
        subs = do.call(CJ, lapply(cats, function(cat) 
          df2[.(category = cat), on="category", pattern]
        ))
    
        .(res = do.call(sprintf, c(.(fmt = new_patt), subs)))
    }, by=names(df1)]
    
    
    #                   pattern cat1 cat2                       res
    # 1:          I want [food]    a    b              I want chips
    # 2:          I want [food]    a    b            I want hot dog
    # 3:          I want [food]    a    b              I want pizza
    # 4: I'm [amplifier] [pos].    a    b           I'm very happy.
    # 5: I'm [amplifier] [pos].    a    b      I'm very optimistic.
    # 6: I'm [amplifier] [pos].    a    b      I'm very much happy.
    # 7: I'm [amplifier] [pos].    a    b I'm very much optimistic.
    

    它是如何工作的

    对象是……

    • cats 是我们需要抓取的类别
    • new_patt 是模式的 sprintf-ready 版本
    • subs 是必须进行的替换表
    • res 是新列

    更棘手的功能是......

    • CJ 采用笛卡尔积,例如 MrFlick 的回答中的 expand.grid
    • do.call(f, list_o_args) 将参数列表传递给函数。

    【讨论】:

    • 谢谢!这真的很有帮助!
    【解决方案2】:

    这当然是非常低效的,但这里涉及到很多步骤。

    # first, find all '[value]' objects
    m.pos <- gregexpr("\\[[^]]+\\]", df1$pattern)
    m.val <- regmatches(df1$pattern,m.pos)
    
    # now we process each row separately
    do.call("rbind", lapply(seq_along(df1$pattern), function(i) {
        # find the values for that row
        tokens <- gsub("(^\\[)|(\\]$)", "", m.val[[i]])
        # get all possible token combinations
        rep.vals <- do.call("expand.grid", list(Map(function(x) df2$pattern[df2$category==x], tokens), stringsAsFactors = FALSE))
        # now do the replacement for each combination
        inreplace <- function(...) {a<-df1$pattern[i]; regmatches(a, m.pos[i]) <- list(c(...)); return(a)}
        ext.vals<-do.call("mapply", c(list(inreplace), rep.vals))
        # merge replaced values with existing columns
        data.frame(pattern = ext.vals, df1[i,-1], row.names=NULL)
    }))
    

    我们 rbind 将我们为每一行创建的所有不同的 data.frames 放在一起。

    【讨论】: