【问题标题】:Matching and replacing strings from a list with another list in R用R中的另一个列表匹配和替换列表中的字符串
【发布时间】:2019-01-12 08:32:24
【问题描述】:

我有两个字符串列表,并且想搜索带有文本的列,以将一个字符串中的项目替换为第二个字符串中的项目。第二个字符串与第一个字符串相同,但包含 HTML 格式的标签。

我编写了一个小函数,尝试为第一个列表中的每个项目grep,同时替换另一个,但效果不佳。我也尝试过str_replace 无济于事。

top_attribute_names<- c("Item Number \\(DPCI\\)", "UPC", "TCIN", "Product Form", "Health Facts", 
"Beauty Purpose", "Package Quantity", "Features", "Suggested Age", 
"Scent")

top_attributes_html<-ifelse(nchar(top_attribute_names)<30,paste("<b>",top_attribute_names,"</b>",sep=""),top_attribute_names) # List adding bold HTML tags for all strings with under 30 char

clean_free_description<-
c("Give your feathered friends a cozy new home with the Ceramic and Wood Birdhouse from Threshold. This simple birdhouse features a natural color scheme that helps it blend in with the tree you hang it from. The ceramic top is easy to remove when you want to clean out the birdhouse, while the small round hole lets birds in and keeps predators out. Sprinkle some seeds inside and watch your bird buddies become more permanent residents of your backyard.\nMaterial: Ceramic, Wood\nDimensions (Overall): 7.7 inches (H) x 8.5 inches (W) x 8.5 inches (L)\nWeight: 2.42 pounds\nAssembly Details: No assembly requiredpets subtype: Bird houses\nProtective Qualities: Weather-resistant\nMount Type: Hanging\nTCIN: 52754553\nUPC: 490840935721\nItem Number (DPCI): 084-09-3572\nOrigin: Imported\n", 
"House your parakeets in style with this Victorian-style bird cage. Featuring multiple colors and faux brickwork, the cage serves as a charming addition to your dcor. It's also equipped with two perches and feeding dishes, making it instantly functional.\nMaterial: Steel, Plastic\nDimensions (Overall): 21.5 inches (H) x 16.0 inches (W) x 16.0 inches (L)\nWeight: 15.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nIncludes: Feeding Dish, perch\nAssembly Details: Assembly required, no tools needed\nPets subtype: Bird cages\nBreed size: Small (0-25 pounds)\nSustainability Claims: Recyclable\nWarranty: 90 day limited warranty. To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nWarranty Information:To obtain a copy of the manufacturer's warranty for this item, please call Target Guest Services at 1-800-591-3869.\nVictorian-style parakeet cage with 2 perches\nFeatures a molded base, a single front door and faux plastic brickwork\nMade of wire and plastic; 5/8\" spacing\nWash with soap and water18\nLx25.5\nHx18\nW\"TCIN: 10159211\nUPC: 048081002940\nItem Number (DPCI): 083-01-0167\n", 
"The Cockatiel Scalloped Top Bird Cage Kit is an ideal starter kit for cockatiels and other medium sized birds. Designer white scalloped style cage features large front door, easy to clean pull out tray, food and water dishes, wooden perches and swing. To help welcome and pamper your new bird, this starter kit also includes perch covers, kabob bird toy, cuttlebone, flavored mineral treat and a cement perch. Easy to assemble.\nMaterial: Metal\nDimensions (Overall): 27.25 inches (H) x 14.0 inches (W) x 18.25 inches (L)\nWeight: 11.0 pounds\nMaterial: Metal (Frame)\nIntended Pet Type: Bird\nPets subtype: Bird cages\nBreed size: All sizes\nTCIN: 16707833\nUPC: 030172016240\nItem Number (DPCI): 083-01-0248\n")

for(i in top_attribute_names){
  clean_free_description[grepl(i, clean_free_description)] <- top_attributes_html[i]
}

理论上,我认为我也可以使用str_replace 来做到这一点:

clean_free_description<-str_replace(clean_free_description,top_attribute_names,top_attributes_html)

但是,这会产生错误:

在 stri_replace_first_regex(string, pattern, fix_replacement(replacement), : 较长的对象长度不是较短对象长度的倍数

当然,我确信有一个更好的解决方案可以添加 HTML 标记,通过匹配正则表达式中的字符串并添加文本包装器来消除一个步骤。不幸的是,我在 Regex 方面还不够好,还没有弄清楚这一点。

【问题讨论】:

  • 我认为使用不同的结构可能会更好。除非您需要将每个项目的所有信息都放在一个字符串中,否则我认为这作为嵌套列表是有意义的,其中列表中的每个项目都有自己的属性,例如项目编号,包装数量等。您可以拆分字符串来构建这样的结构。不过,请随意忽略,因为我的建议与您的具体问题有所不同。

标签: r


【解决方案1】:

您可以尝试stringi::stri_replace_all,如下所示。由于它的长度,我没有在此处绘制完整的输出,但提供了一个更短的示例来演示基本功能,我希望这就是您想要的。

更新:我为 stringi 和 stringr 解决方案添加了一个基准,这清楚地说明了为什么我没有坚持你的原始代码而是在这里引入了 stringi。

stringi::stri_replace_all_regex(c("a", "b", "c"),c("b", "c"),c("x", "y"), vectorize_all = F)
#[1] "a" "x" "y"

stringi::stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)

library(stringr)
library(stringi)

f_stringr = function() {
   names(top_attributes_html) <- top_attribute_names
   str_replace_all(clean_free_description, top_attributes_html)
}

f_stringi = function() {
  stri_replace_all_regex(clean_free_description,top_attribute_names,top_attributes_html, vectorize_all = F)
}

all.equal(f_stringr(), f_stringi())
# TRUE

microbenchmark::microbenchmark(
   f_stringr(), 
   f_stringi()
)
# Unit: microseconds
#        expr     min      lq      mean   median       uq      max neval
# f_stringr() 937.129 956.274 1041.7329 1053.579 1076.276 1296.743   100
# f_stringi() 122.767 128.491  136.6937  137.372  142.899  245.138   100

【讨论】:

    【解决方案2】:

    我认为这应该可以满足您的需求:

    library(stringr)
    names(top_attributes_html) <- top_attribute_names
    str_replace_all(clean_free_description, top_attributes_html)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-03-27
      • 1970-01-01
      • 2021-04-19
      • 1970-01-01
      • 1970-01-01
      • 2019-05-13
      • 1970-01-01
      相关资源
      最近更新 更多