在 R 中为推荐系统生成序列数据答案

【问题标题】：Generating sequence data for recommender system in R在 R 中为推荐系统生成序列数据
【发布时间】：2018-04-01 21:00:29
【问题描述】：

我正在尝试建立一个推荐系统，根据新生的核心课程和历史学生数据（数据包含核心课程和选修课）向新生推荐选修课。

我的数据如下表所示：

我生成了一个交叉表，如表 2 所示（没有 Term_Code 的顺序）

我想生成一个如表3所示的序列数据（Course_Num:Grade的组合应该按照Term_Code的顺序排列

非常感谢任何帮助。提前致谢！

【问题讨论】：

aggregate(yes~Student_Num,transform(df1[2],yes=do.call(paste,c(df1[3:4],sep=":"))),paste)

标签： r associations cluster-analysis sequence recommendation-engine

【解决方案1】：

从表 1 开始可能会更容易（df1 在下面的示例中）

require(dplyr)
set.seed(46)

df1 <- data.frame(Term_Code = sample(2001:2003, 7, T),
                 Student_Num = sample(1:3, 7, T),
                 Course_Num = sample(1000:1003, 7, T),
                 Grade = sample(LETTERS[1:4], 7, T), stringsAsFactors = F)

# A tibble: 7 x 5
# Groups:   Student_Num [3]
#  Term_Code Student_Num Course_Num Grade Sequence
#      <int>       <int>      <int> <chr> <chr>   
#1      2001           2       1003 A     1003:A  
#2      2001           3       1002 D     1002:D  
#3      2002           3       1003 A     1003:A  
#4      2002           1       1000 A     1000:A  
#5      2001           1       1002 B     1002:B  
#6      2002           2       1002 B     1002:B  
#7      2003           1       1003 A     1003:A

df1 %>% 
    group_by(Student_Num) %>% 
    summarise(Sequence = paste(Course_Num, Grade, sep = ':', collapse = ', '))

# A tibble: 3 x 2
#  Student_Num Sequence              
#        <int> <chr>                 
#1           1 1000:A, 1002:B, 1003:A
#2           2 1003:A, 1002:B        
#3           3 1002:D, 1003:A

【讨论】：

谢谢你，雷努！我对这种方法进行了很多思考。方法简洁明了。我真的很感激！

【解决方案2】：

使用tidyverse 软件包套件：

library(tidyverse)

# The pipe operator (%>%) makes df1 the first argument of the next function.
# It lets us look at this "in order" not nested
df1 <- data_frame(
  term_code = c(200701, 200701, 200707, 200701, 200801, 200807, 200707, 200701), 
  student_number = rep(1:3, c(4, 2, 2)),
  course_number = c(1000, 2200, 1100, 4200, 2000, 1100, 2000, 4100),
  grade = c('A','B', 'B-','C','A', 'B','C','E')
)

df1 %>%
  unite(Sequence,c(course_number, grade), sep = ":") %>%
  group_by(student_number) %>%
  summarize(
    Sequence = paste(Sequence, collapse = ", ")
  )

如果您不熟悉管道运算符或我正在使用的其他功能，我会一次调用它，这样您就可以看到它在做什么（所有这些都记录在https://www.tidyverse.org/）。例如，

df1 %>%
  unite(Sequence,c(course_number, grade), sep = ":")

【讨论】：

谢谢你，梅丽莎！这很有帮助。也感谢 tidyverse 的链接！

【解决方案3】：

使用 dplyr 中的 reshape2 和 %>% 运算符

df <- read.csv(text="
Student_Num,1000,1100,2000,2200,4100,4200
1,A,B-,,B,,C
2,,B,A,,,
3,,,C,,E,
", stringsAsFactors = FALSE)


library(reshape2)
library(dplyr)

melt(df, id.vars = "Student_Num",  value.name = 'Grade') %>%
  mutate(variable = substr(variable, 2, 5)) %>%
  filter(Grade != "") %>%
  group_by(Student_Num) %>%
  summarize(Sequence = paste0(variable, ":", Grade, collapse = ","))

#  Student_Num Sequence                    
#        <int> <chr>                       
# 1           1 1000:A,1100:B-,2200:B,4200:C
# 2           2 1100:B,2000:A               
# 3           3 2000:C,4100:E

【讨论】：

谢谢@epi99！我试图实现类似的东西。非常感谢您的意见！