这是一项比您想象的更困难的任务。首先,在 RStudio 中创建一个新项目。然后在项目目录下创建一个files目录,把你所有的文件都收集在那里。
之后,您可以运行下面的脚本。
library(tidyverse)
library(fs)
library(data.table)
readFile = function(fileName){
lines = fread(text = fileName, sep = NULL, header = FALSE)
tibble(txt = lines$V1[1:5]) %>% #1
separate(txt, c("name", "value"), sep = ": ") %>% #2
bind_rows(
tibble(
name = paste0("Keywords", 1:6),
value = lines[6] %>%
str_match("(^.*): (.*)") %>% .[,3] %>%
str_split(", ", 6) %>% .[[1]] %>% .[1:6]) #3
) %>% #4
bind_rows(
tibble(
name = "Abstract",
value = paste(lines$V1[8:nrow(lines)], collapse = " ")) #5
) %>% #6
pivot_wider(1:2) #7
}
files = dir_ls("files")
df = tibble()
for(file in files){
df = df %>% bind_rows(readFile(file))
}
df
df %>% write_csv("Result.csv")
既然你是初学者,让我一步一步解释它是如何工作的。
在我的files 目录中有一个file2.txt。这是它的内容。
Presenter: Ronald Beginer 2
Title: Exploiting
Format: Lecture
Session: 2_mode
Date and time: 03-14-2009 8:30am
Keywords: Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA
An innovative ...bla bla bla and other.
An innovative ...bla bla bla and other.
现在让我向您展示我的readFile 函数在用于读取此文件时是如何工作的。
首先,我将整个文件读入变量lines
lines = fread(text = fileName, sep = NULL, header = FALSE)
然后我把它变成tibble。以下是后续步骤(请参阅 cmets)。
第一步输出
# A tibble: 5 x 1
txt
<chr>
1 Presenter: Ronald Beginer 2
2 Title: Exploiting
3 Format: Lecture
4 Session: 2_mode
5 Date and time: 03-14-2009 8:30am
第 2 步输出
# A tibble: 5 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
现在请注意第四步,我们必须准备一个单独的tibble,其中包含正好六个关键字。这个tibble 是在第 3 步中创建的。
第 3 步输出
# A tibble: 6 x 2
name value
<chr> <chr>
1 Keywords1 Method
2 Keywords2 Two-Mode Data
3 Keywords3 QCA
4 Keywords4 Method
5 Keywords5 Two-Mode Data
6 Keywords6 QCA, Method, Two-Mode Data, QCA
第 4 步输出
# A tibble: 11 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
6 Keywords1 Method
7 Keywords2 Two-Mode Data
8 Keywords3 QCA
9 Keywords4 Method
10 Keywords5 Two-Mode Data
11 Keywords6 QCA, Method, Two-Mode Data, QCA
同样,对于第 6 步,我们需要创建一个单独的 tibble,并将其附加到其余部分。我们在第 5 步中创建了这个 tibble。
第 5 步输出
# A tibble: 1 x 2
name value
<chr> <chr>
1 Abstract An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
第 6 步输出
# A tibble: 12 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
6 Keywords1 Method
7 Keywords2 Two-Mode Data
8 Keywords3 QCA
9 Keywords4 Method
10 Keywords5 Two-Mode Data
11 Keywords6 QCA, Method, Two-Mode Data, QCA
12 Abstract An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
在最后一步,我们将使其变宽。
第 7 步输出
# A tibble: 1 x 12
Presenter Title Format Session `Date and time` Keywords1 Keywords2 Keywords3 Keywords4 Keywords5 Keywords6 Abstract
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Ronald Beginer 2 Exploiting Lecture 2_mode 03-14-2009 8:30am Method Two-Mode Data QCA Method Two-Mode Data QCA, Method, Two-Mode Data, QCA An innovative ...bla bla bla a~
剩下的很简单。为每个文件粘贴由此获得的tibble 并保存到一个csv文件中。
csv 文件
Presenter,Title,Format,Session,Date and time,Keywords1,Keywords2,Keywords3,Keywords4,Keywords5,Keywords6,Abstract
Ronald Beginer 1,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other.
Ronald Beginer 2,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,Method,Two-Mode Data,"QCA, Method, Two-Mode Data, QCA",An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
Ronald Beginer 3,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
附:
在 StackOverflow 上写问题时,千万不要在图片形式中放任何数据!!