用 R 连接几个 DBF 文件答案

【问题标题】：Joining several DBF files with R用 R 连接几个 DBF 文件
【发布时间】：2015-02-01 09:50:33
【问题描述】：

我对 R 并不陌生，我一直在尝试合并从 dbfiles 读取的几个 data.tables。

文件结构如下：

Survey
| ID | ITEMID | ITEMQTY |          DESCCODE | PROVIDERID |
|----|--------|---------|-------------------|------------|
|  1 |      1 |      50 |        sku:247504 |          1 |
|  1 |      2 |       3 | Item discontinued |          1 |
|  1 |      3 |     400 | Item discontinued |          3 |
|  2 |      1 |     500 |      Storage item |          2 |
|  3 |      1 |     500 |    something else |          3 |

Item
| ID |               ITEMNAME | ITEMPRICE |
|----|------------------------|-----------|
|  1 |            Kolashampan |         4 |
|  2 | Arepas by Dr. Colombia |         5 |
|  3 |               Biscotti |         2 |

Provider
| ID |       PROVIDERNAME | LOCATIONID |            PRIMARYCONTACT |
|----|--------------------|------------|---------------------------|
|  1 | Salvadoran Imports |       9056 |             Dra. Castillo |
|  2 |   Rolo Importadora |         46 |              Dra. Coquita |
|  3 |       Il Italianni |         64 | Il Ittalianni call center |

我想要实现的是 3 个文件的基本内部连接，在 sql 中是这样的：

| ID |          DESCCODE |               ITEMNAME | TOTAMOUNT |       PROVIDERNAME |
|----|-------------------|------------------------|-----------|--------------------|
|  1 | Item discontinued | Arepas by Dr. Colombia |        15 | Salvadoran Imports |
|  1 |        sku:247504 |            Kolashampan |       200 | Salvadoran Imports |
|  1 | Item discontinued |               Biscotti |       800 |       Il Italianni |
|  2 |      Storage item |            Kolashampan |      2000 |   Rolo Importadora |
|  3 |    something else |            Kolashampan |      2000 |       Il Italianni |

通过此查询获得：

select
    s.id,
    s.descCode,
    i.itemname,
    (i.itemprice*s.itemqty) as totAmount,
    p.providername
from
    survey s,
    item i,
    provider p
where
    s.itemid = i.id
    and s.providerid = p.id
order by s.id

这是我的代码：

library("shapefiles")
library("data.table")
library("reshape2")
survey <- read.dbf( file.choose(), header="true" )
survey$id <- as.factor( survey$numeric )
print(survey$header$num.records)

item <- read.dbf( file.choose(), header="true" )
item$id <- as.factor( item$numeric )
print(item$header$num.records)

provider <- read.dbf( file.choose(), header="true" )
provider$id <- as.factor( provider$numeric )
print(provider$header$num.records)

setDT(survey, giveNames=FALSE, keep.rownames=FALSE)
setkey(survey, survey$id)

setDT(item, giveNames=FALSE, keep.rownames=FALSE)
setkey(item, item$id)

setDT(provider, giveNames=FALSE, keep.rownames=FALSE)
setkey(provider, provider$id)

merge(survey,item,by="itemid")
merge(survey,provider,by="providerid")

write.dbf(survey[, id, desccode, itemname, itemqty*itemprice, provider, with = FALSE], "joinedFile.dbf")

从这里开始，我遇到的麻烦是：

使用 setDT，我收到此错误参数 'x' 到 'setDT' 中的所有元素必须具有相同的长度
使用 dtSurvey
但是，即使我使用了最后两点，我在尝试使用 setkey 时总是会出错：x is not a data.table
作为一个附加问题，merge 是否可以使用 survey[item] 之类的方法来完成？

非常感谢。

编辑

这些是文件的 dput 输出

调查

structure(list(dbf = structure(list(N_ID_ = structure(c(1L, 1L, 
1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor"), N_ITEMID_ = structure(c(1L, 
2L, 3L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"), 
    N_ITEMQTY_ = structure(c(3L, 1L, 2L, 4L, 4L), .Label = c("3", 
    "400", "50", "500"), class = "factor"), N_________ = structure(c(2L, 
    1L, 1L, 4L, 3L), .Label = c("Item discontinued", "sku:247504", 
    "something else", "Storage item"), class = "factor"), N_PROVIDER = structure(c(1L, 
    1L, 3L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor")), .Names = c("N_ID_", 
"N_ITEMID_", "N_ITEMQTY_", "N_________", "N_PROVIDER"), row.names = c(NA, 
-5L), class = "data.frame", data_types = c("C", "C", "C", "C", 
"C")), header = structure(list(file.version = 3L, file.year = 14L, 
    file.month = 12L, file.day = 3L, num.records = 5L, header.length = 193L, 
    record.length = 53L, fields = structure(list(NAME = structure(c(2L, 
    3L, 4L, 1L, 5L), .Label = c("N_________", "N_ID_", "N_ITEMID_", 
    "N_ITEMQTY_", "N_PROVIDER"), class = "factor"), TYPE = structure(c(1L, 
    1L, 1L, 1L, 1L), .Label = "C", class = "factor"), LENGTH = c(5, 
    8, 9, 19, 12), DECIMAL = c(0L, 0L, 0L, 0L, 0L)), .Names = c("NAME", 
    "TYPE", "LENGTH", "DECIMAL"), row.names = c(NA, -5L), class = "data.frame")), .Names = c("file.version", 
"file.year", "file.month", "file.day", "num.records", "header.length", 
"record.length", "fields")), id = structure(integer(0), .Label = character(0), class = "factor")), .Names = c("dbf", 
"header", "id"))

项目

structure(list(dbf = structure(list(N_ID_ = structure(1:3, .Label = c("1", 
"2", "3"), class = "factor"), N_________ = structure(c(3L, 1L, 
2L), .Label = c("Arepas by Dr. Colombia", "Biscotti", "Kolashampan"
), class = "factor"), N_ITEMPRIC = structure(c(2L, 3L, 1L), .Label = c("2", 
"4", "5"), class = "factor")), .Names = c("N_ID_", "N_________", 
"N_ITEMPRIC"), row.names = c(NA, -3L), class = "data.frame", data_types = c("C", 
"C", "C")), header = structure(list(file.version = 3L, file.year = 14L, 
    file.month = 12L, file.day = 3L, num.records = 3L, header.length = 129L, 
    record.length = 40L, fields = structure(list(NAME = structure(c(2L, 
    1L, 3L), .Label = c("N_________", "N_ID_", "N_ITEMPRIC"), class = "factor"), 
        TYPE = structure(c(1L, 1L, 1L), .Label = "C", class = "factor"), 
        LENGTH = c(5, 24, 11), DECIMAL = c(0L, 0L, 0L)), .Names = c("NAME", 
    "TYPE", "LENGTH", "DECIMAL"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("file.version", 
"file.year", "file.month", "file.day", "num.records", "header.length", 
"record.length", "fields")), id = structure(integer(0), .Label = character(0), class = "factor")), .Names = c("dbf", 
"header", "id"))

提供者

structure(list(dbf = structure(list(N_ID_ = structure(1:3, .Label = c("1", 
"2", "3"), class = "factor"), N_______PR = structure(c(3L, 2L, 
1L), .Label = c("Il Italianni", "Rolo Importadora", "Salvadoran Imports"
), class = "factor"), N_LOCATION = structure(c(3L, 1L, 2L), .Label = c("46", 
"64", "9056"), class = "factor"), N_________ = structure(1:3, .Label = c("Dra. Castillo", 
"Dra. Coquita", "Il Ittalianni call center"), class = "factor")), .Names = c("N_ID_", 
"N_______PR", "N_LOCATION", "N_________"), row.names = c(NA, 
-3L), class = "data.frame", data_types = c("C", "C", "C", "C"
)), header = structure(list(file.version = 3L, file.year = 14L, 
    file.month = 12L, file.day = 3L, num.records = 3L, header.length = 161L, 
    record.length = 64L, fields = structure(list(NAME = structure(c(3L, 
    2L, 4L, 1L), .Label = c("N_________", "N_______PR", "N_ID_", 
    "N_LOCATION"), class = "factor"), TYPE = structure(c(1L, 
    1L, 1L, 1L), .Label = "C", class = "factor"), LENGTH = c(5, 
    20, 12, 27), DECIMAL = c(0L, 0L, 0L, 0L)), .Names = c("NAME", 
    "TYPE", "LENGTH", "DECIMAL"), row.names = c(NA, -4L), class = "data.frame")), .Names = c("file.version", 
"file.year", "file.month", "file.day", "num.records", "header.length", 
"record.length", "fields")), id = structure(integer(0), .Label = character(0), class = "factor")), .Names = c("dbf", 
"header", "id"))

【问题讨论】：

请dput您的样本数据。
为了清楚起见，在调用 read.dbf(...) 之后，数据框中有 survey、item 和 provider。您需要在您的问题中发布dput(survey) 等的输出。如果数据太大，则发布dput(head(survey,20))的输出。发布 SQL 输出完全没用。
另外，您引用列，例如survey$numeric将ID转换为因子时，却没有这样的列？？
对不起，我忘了评论我已经添加了 dputs (._.)

标签： r merge dataframe data.table

【解决方案1】：

基本上，经过这么长时间，我学到了更多R with dplyr，这让我可以做到这一点：

library(dplyr)
library(foreign)

survey <- read.dbf( file.choose(), header="true" )
item <- read.dbf( file.choose(), header="true" )
provider <- read.dbf( file.choose(), header="true" )

# Clean the dataframe names and discard all other parts but dbf
names(survey$dbf) <- c("ID", "ITEMID", "ITEMQTY", "DESCCODE", "PROVIDERID")
survey <- survey$dbf

names(item$dbf) <- c("ID", "ITEMNAME", "ITEMPRICE")
item <- item$dbf

names(provider$dbf) <- c("ID", "PROVIDERNAME", "LOCATIONID", "PRIMARYCONTACT")
provider <- provider$dbf

survey %>%
    inner_join(item, by=c("ITEMID" = "ID")) %>% #join item df
    inner_join(provider, by=c("PROVIDERID" = "ID")) %>% #join provider df
    mutate(ITEMQTY=as.character(ITEMQTY), ITEMPRICE=as.character(ITEMPRICE)) %>% # convert from factor to char
    mutate(TOT_AMOUNT=as.numeric(ITEMQTY)*as.numeric(ITEMPRICE)) %>% # create new colum
    select(ID, DESCCODE, ITEMNAME,TOT_AMOUNT, PROVIDERNAME) %>% # select fields I need
    arrange(ID, DESCCODE) #sort data
# This output can be assigned to a new variable.

【讨论】：