【发布时间】:2017-08-29 14:17:32
【问题描述】:
我正在做一个历史项目,我们需要了解每个月的员工数量。对于数据集中的每个人,我都有他们受雇的时期。不同职位的 fname_code 代码。例如,Edmont Privat 博士在如下所示的各个时期具有两种不同的功能:
pname fname_code begin_date end_date
1 Dr. Edmond Privat 3 1921-09-02 1921-10-07
2 Dr. Edmond Privat 2 1921-12-07 1922-03-06
3 Joseph Louis Marie Charles Avenol 1 1923-02-01 1933-07-01
4 Joseph Louis Marie Charles Avenol 1 1933-07-01 1940-08-31
5 Dr. G. G. Kullmann 2 1931-03-30 1938-12-15
我的想法是将此信息转换为主题/日期期间数据框/矩阵,其中 N 表示此人此时不在公司,而数字表示他们已被雇用以及他们的级别。这是我想到的一个例子:
1944-07-01 1944-08-01 1944-09-01 1944-10-01
Albert Dan Meurig Evans N N N N
Genevieve Jeanne Leonie Mayor N 2 3 3
我已经做了一个可以完成这项工作的东西 - 它已经完成了上述几行 - 但是,代码定义不适合虚伪的程序员(很多前循环!)。我的问题是,你们中是否有经验丰富的程序员对提高速度有建议,或者以完全不同的方式实现我的目标。 我试过玩一些 dplyr 函数,但我对它们的经验太少,无法让它们工作。我还考虑过创建一个 if 条件来处理一个人刚被雇用一段时间的情况,因为在这种情况下不需要 forloop - 但我不确定在哪里以最佳方式实施它。
我的计算灾难的逻辑是查看数据框和就业数据框的月份间隔之间是否存在重叠:
library(lubridate)
library(tidyverse)
#creating sequence of dates for columns
start_date <- as.Date("1919-01-01")
end_date <- as.Date("1948-12-30")
dates <- seq.Date(start_date, end_date, by ="month")
#dates as columns and names on columns
test.df <- matrix(ncol =length(dates), nrow = nlevels(mdl_df$pname))
test.df <- as.data.frame(test.df)
colnames(test.df) <- dates
rownames(test.df) <- levels(mdl_df$pname)
for (name in 1:nlevels(mdl_df$pname)){
#subsetting the rows for each person
person_rows <- mdl_df %>% filter( mdl_df$pname == rownames(test.df)[name])
for (date in 1:(length(dates)-1)) {
#Creating a month interval consisting of the time between two adjecent months
interval1 <- interval(ymd(colnames(test.df)[date]),ymd(colnames(test.df)[date+1]))
for (row in 1:nrow(person_rows)) {
#check if overlap between df month interval and employment intervals.
interval2 <- interval(ymd(person_rows$begin_date[row]),ymd(person_rows$end_date[row]))
if (int_overlaps(interval1, interval2)){
#checking if df period and work period overlap. If so rank is inserted otherwise N is entered
test.df[name,date] <- test_rows$fname_code[row]
break
}else{
test.df[name,date] <- "N"
}
}
}
}
该数据集包含大约 3000 名员工,我的计算机大约需要 6-7 小时才能完成这项工作。在接下来的几周内,我需要在各种数据集上多次运行并重新运行脚本,因此非常感谢任何帮助!
编辑:数据集前 50 行的 dput 输出。
> dput(droplevels(head(mdl_df, 50)))
structure(list(pname = structure(c(7L, 7L, 24L, 24L, 8L, 19L,
16L, 16L, 16L, 4L, 34L, 11L, 17L, 12L, 23L, 10L, 14L, 14L, 14L,
14L, 14L, 32L, 5L, 22L, 29L, 3L, 13L, 25L, 2L, 6L, 26L, 18L,
21L, 27L, 27L, 28L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 15L, 31L,
33L, 30L, 30L, 1L), .Label = c("A. Gordon Bagnall", "Bertil Gotthard Ohlin",
"Birgit Nissen", "Bryan Fullerton Adams", "C.H. Wykes", "Christian Olsen",
"Dr. Edmond Privat", "Dr. G. G. Kullmann", "Eugène Henri René Vigier",
"Ewan P. Wallis-Jones", "Francis Yeats-Brown", "Francisco Walker-Linares",
"Frank Horsfall Nixon", "Frank Paul Walters", "Franklin Urteaga",
"Gerald Heguerty Furtado Abraham", "Gladys Wade", "Guillaume Théodore Conrad Zwerner",
"Henri Bonnet", "Haakon Vigander", "Ignacio J. Valdes", "Ingvad Nielsen",
"Jessie Irene Wall", "Joseph Louis Marie Charles Avenol", "Julian Nogueira",
"Konni Zilliacus", "Luis Varela-Obregoso", "Marc Veillet-Lavallee",
"Maria Nielsen", "Peter Martin Anker", "Pierre Achille Louis Eugène Quesnay",
"Pierre Henry Watier", "Prof. Fred Alexander", "Robert André Felix Bach"
), class = "factor"), fname_code = c(3L, 2L, 1L, 1L, 2L, 2L,
2L, 0L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 6L,
0L, 0L, 1L, 2L, 3L, 6L, 2L, 2L, 2L, 2L, 2L, 2L), begin_date = structure(c(-17653,
-17557, -17136, -13333, -14157, -17897, -18050, -13789, -8962,
-15010, -11810, -15372, -14003, -14855, -16047, -12900, -18494,
-18254, -14245, -13333, -11172, -12008, -18398, -14360, -15002,
-11802, -17883, -12862, -14245, -17136, -18248, -14975, -13989,
-15494, -15372, -14108, -14738, -18201, -17849, -17849, -11657,
-10592, -10579, -10130, -11436, -16849, -13631, -14033, -11161,
-12620), class = "Date"), end_date = structure(c(-17618, -17468,
-13333, -10715, -11340, -14243, -13789, -11223, -8624, -11178,
-10797, -17543, -13982, -8555, -15628, -12879, -18254, -14245,
-13333, -11172, -10809, -11822, -18255, -14339, -14988, -11781,
-17078, -11158, -13958, -16590, -11401, -14610, -13968, -15434,
-15007, -13920, -14717, -17849, -8524, -8524, -8524, -8524, -8524,
-8524, -11415, -15707, -13613, -11161, -8555, -12614), class = "Date")),
.Names = c("pname", "fname_code", "begin_date", "end_date"), row.names = c(NA, 50L), class = "data.frame")
运行 Rstudio v.1.0.136
附件:
[1] dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1
[7] tidyverse_1.1.1 lubridate_1.6.0
【问题讨论】:
-
您可以使用
dput以便我们重现您的数据集吗? -
当然,现在作为编辑添加。
-
需要添加包依赖。
-
是的,先生,信息也添加为编辑:运行 Rstudio v.1.0.136。附加包:[1] dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1 [7] tidyverse_1.1.1 lubridate_1.6.0
-
我的意思是您需要包含
library代码,这样您的代码才能重现。
标签: r datetime optimization plyr data-conversion