【发布时间】:2015-03-14 13:10:42
【问题描述】:
我有一个字符变量的数据框,其中包含长段落,我需要在某些短语确定的位置进行拆分。然而问题在于,在许多情况下,这些短语会与前面的单词合并。
这是我正在做的事情:
data <- readLines(n=2)
= DAY 1 CHALLENGES = syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name
= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.
df <- as.data.frame(data)
delim <- c("= DAY 1 CHALLENGES = ",
"= DAY 2 CHALLENGES = ",
"= DAY 3 CHALLENGES = ",
"= DAY 4 CHALLENGES = ",
"= DAY 5 CHALLENGES = ",
"= DAY 6 CHALLENGES = ")
y <- data.frame(do.call('rbind',
strsplit(as.character(df$data), delim, fixed = FALSE)))
y
X1
1
2 = DAY 1 CHALLENGES = very high.
X2
1 syndicated.= DAY 2 CHALLENGES = Red Sea.= DAY 3 CHALLENGES = framework.= DAY 4 CHALLENGES = Did ;-)= DAY 5 CHALLENGES = Paste ...= DAY 6 CHALLENGES = Name
2 Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result.= DAY 5 CHALLENGES = coffee.= DAY 6 CHALLENGES = Bla.
我想获取每个 = DAY x CHALLENGES = 文本段,直到下一个这样的段作为单独的变量。
谢谢!
更新建议的方法:
> a <- scan(file ="~/Desktop/alm/a.txt", what="")
Read 1 item
> a
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. = DAY 5 CHALLENGES = Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 = DAY 6 CHALLENGES = Bla."
> b <- scan(file ="~/Desktop/alm/b.txt", what="")
Read 1 item
> b
[1] "= DAY 1 CHALLENGES = very high.= DAY 2 CHALLENGES = Rank understand.= DAY 3 CHALLENGES = buy....= DAY 4 CHALLENGES = result. Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5 ?= DAY 6 CHALLENGES = Bla."
> c <- c(a,b)
> df <- as.data.frame(c)
> lst <- strsplit(gsub(" (?=\\= DAY)", ".", c, perl=TRUE),
+ '(?<=[.)])(?=\\=)', perl=TRUE)
> out <- do.call(cbind, lapply(lst, function(x) sub('^=.*= ', '', x)))
Warning message:
In (function (..., deparse.level = 1) :
number of rows of result is not a multiple of vector length (arg 2)
> out
[,1]
[1,] "very high."
[2,] "Rank understand."
[3,] "buy...."
[4,] "result.."
[5,] "Paste the link(s) that you think is Paid Media.http://lebron11.nikeinc.com/ DAY 5 CHALLENGE: Paste the link(s) that you think is Owned Media.http://www.nike.com/ ; https://www.pinterest.com/nikewomen DAY 5 CHALLENGE: Paste the link(s) that you think is BONUS QUESTION DAY 5."
[6,] "Bla."
[,2]
[1,] "very high."
[2,] "Rank understand."
[3,] "buy...."
[4,] "Bla." #this is not the value from the input file
[5,] "very high." #this is missing in the input file, yet a value is getting output
[6,] "Rank understand." #incorrect recognition of ?= DAY 6 CHALLENGES =; the same happens with := and != or similar
问题在 cmets 中指出。 缺失值的指示将很有用,而不是插入随机值。
【问题讨论】:
-
你必须做得更好,请给我们一些示例输入和一些示例输出。
-
@StefanPetkov 再次更新了帖子。正如我之前提到的,我已经在这篇文章上花了一些时间。如果这不起作用,您可能没有按照我的建议提供所有模式......
-
我认为它有效,但有“:= DAY x =”的情况除外。然后它失败了。但是 "?= DAY x =" 现在可以正常使用了。