删除R中字符串中位置的字符？答案

【问题标题】：Delete characters at positions within a string in R?删除R中字符串中位置的字符？
【发布时间】：2012-08-16 08:09:57
【问题描述】：

我正在寻找一种方法来删除R中字符串中某些位置的字符。例如，如果我们有一个字符串"1,2,1,1,2,1,1,1,1,2,1,1"，我想删除第三、第四、第七和第八位。该操作将生成字符串："1,1,2,1,1,1,1,2,1,1"。

不幸的是，使用 strsplit 将字符串分解为列表不是一种选择，因为我正在使用的字符串长度超过 100 万个字符。考虑到我有大约 2,500 个字符串，这需要相当长的时间。

另外，找到一种方法用空字符串"" 替换字符将达到相同的目的 - 我认为。考虑到这种思路，我偶然发现了这篇 StackOverflow 帖子：

R: How can I replace let's say the 5th element within a string?

不幸的是，建议的解决方案很难有效地概括，对于要删除的 2000 个位置列表，每个输入字符串大约需要 60 秒：

subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
    memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}

查看问题，我发现了一段sn-p的代码，好像是用"-"替换了某些位置的字符：

subchar <- function(string, pos) {
        for(i in pos) {
            string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
        }
        return(string)
}

我（还）不太了解正则表达式，但我强烈怀疑这些方面的内容在时间上比第一个代码解决方案要好得多。不幸的是，当 pos 中的值变高时，这个 subchar 函数似乎会中断：

> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'

我也在考虑尝试使用 SQL 将字符串数据读入表中，但我希望会有一个优雅的字符串解决方案。在 R 中执行此操作的 SQL 实现似乎相当复杂。

有什么想法吗？谢谢！

【问题讨论】：

字符串从何而来？用 R 以外的方式预处理数据可能更容易。
字符串来自 .RData 文件，但我可以快速将其写入文本文件，从而打开范围。有什么语言建议吗？
他们总是用逗号分开数字吗？如果是这样，将它们分别转换为向量、子集，然后再转换回字符不是更容易吗？然后你可以使用数字索引来删除元素。
是的，它们总是以逗号分隔。不幸的是，正如帖子中所述，strsplit() 处理超过一百万个字符的字符串需要很长时间。有没有快速转换成向量的方法？

标签： string r character

【解决方案1】：

一个快速的解决方法是删除 for 循环中的粘贴

subchar3<-function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = c(string,substr(inputstring, (memory+1), (num-1)))
    memory = num
}
string = paste(c(string, substr(inputstring,(memory+1), nchar(inputstring))),collapse = "")
return(string)
}
data<-paste(sample(letters,100000,replace=T),collapse='')
remove<-sample(1:nchar(data),200)
remove<-remove[order(remove)]
s2<-subchar2(data,remove)
s3<-subchar3(data,remove)
identical(s2,s3)
#[1] TRUE

> library(rbenchmark)
> benchmark(subchar2(data,remove),subchar3(data,remove),replications=10)
                    test replications elapsed relative user.self sys.self
1 subchar2(data, remove)           10   43.64 40.78505     39.97      1.9
2 subchar3(data, remove)           10    1.07  1.00000      1.01      0.0
  user.child sys.child
1         NA        NA
2         NA        NA

【讨论】：

哇，我刚刚实现了这个，到目前为止它很棒！谢谢。我觉得改进非常奇怪。

【解决方案2】：

使用scan() 阅读它们。您可以将分隔符设置为 "," 和 what="a"。您可以使用nlines=1 一次scan 一个“行”，如果是textConnection，则“管道”将“记住”上次读取时的位置。

x <- paste( sample(0:1, 1000, rep=T), sep=",")
xin <- textConnection(x)

x995 <- scan(xin, sep=",", what="a", nmax=995)
# Read 995 items
x5 <- scan(xin, sep=",", what="a", nmax=995)
# Read 5 items

这是一个带有 5 条“线”的插图

> x <- paste( rep( paste(sample(0:1, 50, rep=T), collapse=","),  5),  collapse="\n")
> str(x)
 chr "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0\n1,0,0,0,0,1,0,0,1,1,1,0,1,"| __truncated__
> xin <- textConnection(x)
> x1 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x2 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x3 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x4 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x5 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x6 <- scan(xin, sep=",", what="a", nlines=1)
Read 0 items
> length(x1)
[1] 50
> length(x1[-c(3,4,7,8)])
[1] 46
> paste(x1, collapse=",")
[1] "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0"
>

【讨论】：

【解决方案3】：

如果使用fixed = TRUE，strsplit 的速度会快十倍以上。粗略推断，处理 2,500 个由 1,000,000 个逗号分隔的整数组成的字符串将需要 2 分钟多一点的时间。

N <- 1000000
x <- sample(0:1, N, replace = TRUE)
s <- paste(x, collapse = ",")

# this is a vector of 10 strings
M <- 10
S <- rep(s, M)

system.time(y <- strsplit(S, split = ","))
# user  system elapsed 
# 6.57    0.00    6.56 
system.time(y <- strsplit(S, split = ",", fixed = TRUE))
# user  system elapsed 
# 0.46    0.03    0.50

这几乎比使用扫描快 3 倍：

system.time(scan(textConnection(S), sep=",", what="a"))
# Read 10000000 items
# user  system elapsed 
# 1.21    0.09    1.42

【讨论】：

最快的解决方案。现在在 R 中获得快乐的数据！
整个数据处理刚刚完成。估计应该推迟到几个小时。