【问题标题】:Create a vector of unique values out of several columns with overlapping values从具有重叠值的几列中创建一个唯一值向量
【发布时间】:2014-06-25 07:40:57
【问题描述】:

在我的 data.frame 中,我在一行的 SUBJECT 上有三列。我想要一个额外的列,每行都有一个独特的主题。首先,我的数据是什么样子的:

DATE <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7")
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
CLASS <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral")
SUBJECT.1 <- c("LITIGATION","LAYOFF","POLLUTION","CHEMICAL DISASTER","PRESS RELEASE","PEOPLE","EMISSIONS","ENERGY","WASTE MANAGEMENT","EMPLOYEES","MANAGEMENT","PRESS RELEASE","HOTELS","POLLUTION")
SUBJECT.2 <- c("POLLUTION","EMPLOYEES","NUCLEAR","FUELS","STOCK OPTION PLAN","EXECUTIVES","CO2","SOLAR","POLLUTION","EXECUTIVES","PRESS RELEASE","CELEBRITIES","CELEBRITIES","LITIGATION")
SUBJECT.3 <- c("ENVIRONMENT","JOB REDUCTIONS","POWER PLANTS","POLLUTION","EMPLOYEES","FRAUD","CLIMATE CHANGE","SUSTAINABILITY","HAZARDOUS WASTE","BONUS PAY","LITIGATION","EMISSIONS","SCANDALS","SCANDALS")
CONTROLVAR <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")

mydf <- data.frame(DATE, COMP, RET, CLASS, SUBJECT.1, SUBJECT.2, SUBJECT.3, CONTROLVAR, stringsAsFactors=F)

mydf

#    DATE COMP   RET    CLASS         SUBJECT.1         SUBJECT.2       SUBJECT.3 CONTROLVAR
# 1     1    A -2.00 positive        LITIGATION         POLLUTION     ENVIRONMENT         11
# 2     2    A  1.10 negative            LAYOFF         EMPLOYEES  JOB REDUCTIONS         13
# 3     3    A  3.00 aneutral         POLLUTION           NUCLEAR    POWER PLANTS         13
# 4     4    A  1.40 positive CHEMICAL DISASTER             FUELS       POLLUTION         14
# 5     5    A -0.20 positive     PRESS RELEASE STOCK OPTION PLAN       EMPLOYEES         13
# 6     6    A  0.60 negative            PEOPLE        EXECUTIVES           FRAUD         14
# 7     7    A  0.10 aneutral         EMISSIONS               CO2  CLIMATE CHANGE         12
# 8     1    B -0.21 positive            ENERGY             SOLAR  SUSTAINABILITY         11
# 9     2    B -1.20 negative  WASTE MANAGEMENT         POLLUTION HAZARDOUS WASTE         13
# 10    3    B  0.90 negative         EMPLOYEES        EXECUTIVES       BONUS PAY         13
# 11    4    B  0.30 positive        MANAGEMENT     PRESS RELEASE      LITIGATION         14
# 12    5    B -0.10 aneutral     PRESS RELEASE       CELEBRITIES       EMISSIONS         13
# 13    6    B  0.30 aneutral            HOTELS       CELEBRITIES        SCANDALS         14
# 14    7    B -0.12 aneutral         POLLUTION        LITIGATION        SCANDALS         12

由于我想将主题作为虚拟变量(应该是排他的)包含在以后的回归中,我想要一个单列 SUBJECT,每行都有一个唯一的主题。我想重点关注诉讼、污染和裁员的主题。

我想从左到右检查每个 SUBJECT 列的 LITIGATION、POLLUTION 和 LAYOFF。

如果第一栏中有 LITIGATION、POLLUTION 或 LAYOFF 三个科目之一,则选择该科目。如果第一列有不同的主题,我检查第二列,依此类推。如果三个主题列中没有一个包含 LITIGATION、POLLUTION 或 LAYOFF,则应将主题称为 OTHER。 此外,一些主题应该被分组。在本例中,排放应被视为污染。

输出应如下所示:

#    DATE COMP   RET    CLASS         SUBJECT.1         SUBJECT.2       SUBJECT.3    SUBJECT CONTROLVAR
# 1     1    A -2.00 positive        LITIGATION         POLLUTION     ENVIRONMENT LITIGATION         11
# 2     2    A  1.10 negative            LAYOFF         EMPLOYEES  JOB REDUCTIONS     LAYOFF         13
# 3     3    A  3.00 aneutral         POLLUTION           NUCLEAR    POWER PLANTS  POLLUTION         13
# 4     4    A  1.40 positive CHEMICAL DISASTER             FUELS       POLLUTION  POLLUTION         14
# 5     5    A -0.20 positive     PRESS RELEASE STOCK OPTION PLAN       EMPLOYEES      OTHER         13
# 6     6    A  0.60 negative            PEOPLE        EXECUTIVES           FRAUD      OTHER         14
# 7     7    A  0.10 aneutral         EMISSIONS               CO2  CLIMATE CHANGE  POLLUTION         12
# 8     1    B -0.21 positive            ENERGY             SOLAR  SUSTAINABILITY      OTHER         11
# 9     2    B -1.20 negative  WASTE MANAGEMENT         POLLUTION HAZARDOUS WASTE  POLLUTION         13
# 10    3    B  0.90 negative         EMPLOYEES        EXECUTIVES       BONUS PAY      OTHER         13
# 11    4    B  0.30 positive        MANAGEMENT     PRESS RELEASE      LITIGATION LITIGATION         14
# 12    5    B -0.10 aneutral     PRESS RELEASE       CELEBRITIES       EMISSIONS  POLLUTION         13
# 13    6    B  0.30 aneutral            HOTELS       CELEBRITIES        SCANDALS      OTHER         14
# 14    7    B -0.12 aneutral         POLLUTION        LITIGATION        SCANDALS  POLLUTION         12

谢谢!

【问题讨论】:

    标签: r variables dataframe unique


    【解决方案1】:
    mydf$SUBJECT <- "OTHER"
    sapply(c("SUBJECT.3", "SUBJECT.2", "SUBJECT.1"), function(x) mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), "SUBJECT"] <<- mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), x])
    mydf$SUBJECT[mydf$SUBJECT == "EMISSIONS"] <- "POLLUTION"
    

    【讨论】:

    • @David Arenburg:谢谢你的回答!是否可以仅使用部分单词进行分组。可能是一个不好的例子,但假设我希望每个包含“EMISS”、“WAST”和“POLLU”的值最终变成污染。这有可能以某种方式融入您的方法吗?我确定“grepl()”可以解决问题,但不知道具体是怎么做的......
    • 我现在没有时间修改这个(在工作中),我今晚晚些时候可以看看。同时看看@agstudy 的答案,他在那里使用正则表达式
    【解决方案2】:

    这里有几行解决方案,只需将 3 个 ifelse() 链接在一起。

    important &lt;- c('LITIGATION','POLLUTION','LAYOFF','EMISSIONS')

    ifelse( mydf$SUBJECT.1 %in% important, mydf$SUBJECT.1,
           ifelse( mydf$SUBJECT.2 %in% important, mydf$SUBJECT.2,
                 ifelse( mydf$SUBJECT.3 %in% important, mydf$SUBJECT.3,'OTHER')))
    

    mydf$SUBJECT[mydf$SUBJECT=='EMISSIONS'] &lt;- 'POLUTION'

    【讨论】:

    • 看来您需要先将 EMISSIONS 改为 POLLUTION。
    • @BeginneR 我在最后将排放更改为污染,尽管不可否认,当我在最后添加额外的行时(当我看到 cptn 想将污染与排放混为一谈时)我忘了将它添加到important 向量。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-03-13
    • 2023-03-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-02-23
    • 2019-03-24
    相关资源
    最近更新 更多