【问题标题】:Wrangling dataset by picking out Rotten Tomatoes movie ratings from a column通过从列中挑选烂番茄电影评分来整理数据集
【发布时间】:2019-11-03 11:25:07
【问题描述】:

我有这个示例数据集:

structure(list(Title = c("Isn't It Romantic", "Isn't It Romantic", 
"Isn't It Romantic", "Isn't It Romantic", "Isn't It Romantic", 
"Isn't It Romantic", "Gully Boy", "Gully Boy", "Gully Boy", "Gully Boy", 
"Gully Boy", "Gully Boy", "The Wandering Earth", "The Wandering Earth", 
"The Wandering Earth", "The Wandering Earth", "The Wandering Earth", 
"The Wandering Earth", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World", 
"How to Train Your Dragon: The Hidden World", "American Woman", 
"American Woman", "Us", "Us", "Us", "Us", "Us", "Us", "The Wolf's Call", 
"The Wolf's Call", "Avengers: Endgame", "Avengers: Endgame", 
"Avengers: Endgame", "Avengers: Endgame", "Avengers: Endgame", 
"Avengers: Endgame", "The Silence", "The Silence", "The Silence", 
"The Silence", "The Silence", "The Silence", "My Little Pony: Equestria Girls: Spring Breakdown", 
"My Little Pony: Equestria Girls: Spring Breakdown"), Ratings = c("Internet Movie Database", 
"5.9/10", "Rotten Tomatoes", "68%", "Metacritic", "60/100", "Internet Movie Database", 
"8.4/10", "Rotten Tomatoes", "100%", "Metacritic", "65/100", 
"Internet Movie Database", "6.4/10", "Rotten Tomatoes", "74%", 
"Metacritic", "62/100", "Internet Movie Database", "7.6/10", 
"Rotten Tomatoes", "91%", "Metacritic", "71/100", "Rotten Tomatoes", 
"57%", "Internet Movie Database", "7.1/10", "Rotten Tomatoes", 
"94%", "Metacritic", "81/100", "Internet Movie Database", "7.6/10", 
"Internet Movie Database", "8.7/10", "Rotten Tomatoes", "94%", 
"Metacritic", "78/100", "Internet Movie Database", "5.2/10", 
"Rotten Tomatoes", "23%", "Metacritic", "25/100", "Internet Movie Database", 
"7.7/10")), row.names = c(NA, -48L), class = c("tbl_df", "tbl", 
"data.frame"))

Ratings 列为每部电影提供 3 种不同类型的评分(Imdb、Rotten Tomatoes 和 Metacritic),每部电影分布在 6 行中。

我想整理这个数据集,以便为每部电影创建一个名为 rottentomatoes_rating 的新列,其值为评分。所以,在我的样本数据集中,难道不是浪漫电影在rottentomatoes_rating 下有 68%,Gully Boy 在rottentomatoes_rating 下有 100%,等等。

对于那些没有rottentomatoes_rating 的电影,我想将 NA 放在rottentomatoes_rating 下。

我曾考虑在 tidyr 中使用 spread,但我不知道该怎么做,因为在我的情况下,变量和值都在同一列中!

【问题讨论】:

    标签: r tidyr


    【解决方案1】:

    如果整个数据集中的数据格式相似,则以下代码应该可以工作:

    df %>% group_by(Title) %>% 
      slice(match("Rotten Tomatoes", df$Ratings) + 1) %>%
      rename(rottentomatoes_rating = Ratings)
    

    这给出了:

    # A tibble: 2 x 6
    # Groups:   Title [2]
      Title             Year  Rated     Released   Runtime rottentomatoes_rating
      <chr>             <chr> <chr>     <date>     <chr>   <chr>                
    1 Gully Boy         2019  Not Rated 2019-02-14 153 min 100%                 
    2 Isn't It Romantic 2019  PG-13     2019-02-13 89 min  68%     
    

    对于NAs,如果原始数据总是在观察字符串后的行有RT分数,那么它应该默认给你NA

    【讨论】:

    • 大家好,我已经编辑了我的示例数据集,包括其他没有烂番茄评级的电影。你介意看看这个并调整你的答案吗?目前,您的答案不包括没有烂番茄评级的电影。谢谢!
    【解决方案2】:

    sumshyftw 答案很好。

    但如果你只是想获得烂番茄的百分比,这里有一个data.table 版本:

    dt <- dt[dt$Ratings %like% "%",]
    dt <- setnames(dt, "Ratings", "rottentomatoes_rating")
    

    输出:

    # A tibble: 2 x 6
      Title             Year  Rated     Released   Runtime rottentomatoes_rating
      <chr>             <chr> <chr>     <date>     <chr>   <chr>                
    1 Isn't It Romantic 2019  PG-13     2019-02-13 89 min  68%                  
    2 Gully Boy         2019  Not Rated 2019-02-14 153 min 100%  
    

    我使用了%like% "%",因为我假设完整的数据就像你的例子一样。

    【讨论】:

    • 大家好,我已经编辑了我的示例数据集,包括其他没有烂番茄评级的电影。你介意看看这个并调整你的答案吗?目前,您的回答不包括没有烂番茄评级的电影。谢谢!
    • @JayBaik 嗨,我建议您使用 AntoniosK 的答案,它应该可以完美地满足您的需求。当没有烂番茄值时,它会给你&lt;NA&gt;。如果您或某人只是想要存在 Rotten Tomatoes % 的数据,我的 data.table 版本是一个简单的答案。祝你有美好的一天!
    【解决方案3】:

    使用data.table获取所有指标

    # using data.table
    library(data.table)
    dt <- as.data.table(df)
    
    # groups the data set with by, and extracts the Ratings
    # makes use of logic that the odd indeces hold the name of the provider,
    # the even ones hold the values. Only works if this holds.
    # It can probably be optimised a bit. dcast converts from long to required wide
    # format
    splitRatings <- function(Ratings){
      # e.g. Ratings=dt$Ratings[1:6]
      N <- length(Ratings)
      split_dt <- data.table(DB=Ratings[1:N %% 2 == 1],
                             Values=Ratings[1-(1:N %% 2) == 1])
      out <- dcast(split_dt, .~DB, value.var = "Values")
      out[, ".":=NULL]
      out
    }
    
    # applies the function based on the by clause, returning the table embedded
    dt2 <- dt[, splitRatings(Ratings), by=.(Title, Year, Rated, Released, Runtime)]
    
    # convert back
    out <- as.data.frame(dt2)
    

    【讨论】:

    • 大家好,我已经编辑了我的示例数据集,包括其他没有烂番茄评级的电影。你介意看看这个并调整你的答案吗?目前,您的答案不包括没有烂番茄评级的电影。谢谢!
    • 我有一些东西,但将作为一个新的解决方案发布,因为它完全不同
    【解决方案4】:

    假设您的数据集名为 dt,您可以使用此过程来获得数据集的整洁版本:

    library(tidyverse)
    
    # specify indexes of Rating companies
    ids = seq(1, nrow(dt), 2)
    
    # get rows of Rating companies
    dt %>% slice(ids) %>%
      # combine with the rating values
      cbind(dt %>% slice(-ids) %>% select(RatingsValue = Ratings)) %>%
      # reshape dataset
      spread(Ratings, RatingsValue)
    
    #                Title Year     Rated   Released Runtime Internet Movie Database Metacritic Rotten Tomatoes
    # 1         Gully Boy 2019 Not Rated 2019-02-14 153 min                  8.4/10     65/100            100%
    # 2 Isn't It Romantic 2019     PG-13 2019-02-13  89 min                  5.9/10     60/100             68%
    

    【讨论】:

    • 大家好,我已经编辑了我的示例数据集,包括其他没有烂番茄评级的电影。你介意看看这个并调整你的答案吗?目前,您的答案不包括没有烂番茄评级的电影。谢谢!
    • 看起来你还没有尝试过 :) 我的方法会给你NAs 任何你没有评分值的地方。
    【解决方案5】:

    空白时填充 NA 值的新版本

    # using data.table
    library(data.table)
    dt <- as.data.table(df)
    
    # Index will hold whether the row is a Provider eg Rotten Tomatoes, or a value
    dt[, Index:=rep(c("Provider", "Value"), .N/2)]
    # Need an index to bind these together
    dt[, Provider.Id:=rep(1:(.N/2), each=2), by=Title]
    dt[1:6,]
    
    # segment out the Provider & Values in to columns
    out <- dcast(dt, Title+Provider.Id~Index, value.var = "Ratings")
    out[, Provider := NULL]
    
    # now convert to full wide format 
    out_df <- as.data.frame(dcast(out, Title~Provider, value.var="Value", fill=NA))
    out_df
    

    【讨论】:

      【解决方案6】:

      这是一个版本。

      df %>% 
        mutate(Value = ifelse(str_detect(Ratings, "\\d"), Ratings, NA)) %>% 
        fill(Value, .direction = "up") %>% 
        filter(!str_detect(Ratings, "\\d")) %>% 
        spread(Ratings, Value)
      

      【讨论】:

        猜你喜欢
        • 2015-03-29
        • 1970-01-01
        • 1970-01-01
        • 2017-09-17
        • 2022-06-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多