如何使用 xml 解析这些数据答案

【问题标题】：how can I parse this data using xml如何使用 xml 解析这些数据
【发布时间】：2016-11-15 20:32:11
【问题描述】：

我有一个可以从这里下载的数据 http://mips.helmholtz-muenchen.de/proj/ppi/ 在页面的最后，写着“你可以得到完整的数据集”

然后我尝试使用xml包

library(XML)
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE)
root <- xmlRoot(doc)

但它似乎是空的

我想要什么？

如果我打开从该网站下载的 allppi.xml，我想将特定的行解析成一个txt文件，它以<fullName>开头，以</fullName>结尾

例如，如果我打开那个文件，我可以看到这个

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>

那我想要这个

Proteins                   description 
S100A8;CAGA;MRP8     calgranulin A (migration inhibitory factor-related protein 8)

【问题讨论】：

需要先下载解压文件，然后才能解析。 This shows a way。所以试试temp <- tempfile() ; download.file("http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz", temp) ; unz(temp, "allppis.xml")，然后doc <- xmlTreeParse(temp, useInternal = TRUE) ; root <- xmlRoot(doc)
还有这个包可能有用bioconductor.org/packages/release/bioc/html/RpsiXML.html
@user20650 现在我只需键入 doc，我看到 xml 在其中，但它保存在哪里？你能帮我得到我想要的确切输出吗？
好的，你可以下载了。我不知道如何解析这个 - 因此只是评论 ^^ 来帮助下载。您是否查看 RpsiXML 是否有架构？
@user20650 是的，我对这个包很熟悉，这些包中的大多数都是为出版而编写的，我无法进入它们。但是，我非常感谢您的大力帮助，我等着看是否有人会帮助我进行解析

标签： r xml

【解决方案1】：

我认为您想要这样的东西（问题不是很清楚 IMO）。我还认为主要问题是默认命名空间，这绝对是一种痛苦：

library(xml2)
library(purrr)
library(dplyr)
library(stringi)

doc <- read_xml("allppis.xml")

ns <- xml_ns_rename(xml_ns(doc), d1="x")

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
  xml_text() %>% 
  stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
  as_data_frame() %>% 
  setNames(c("Proteins", "Description")) %>% 
  mutate(Proteins=trimws(Proteins),
         Description=trimws(Description))
## # A tibble: 3,628 × 2
##             Proteins                                                    Description
##                <chr>                                                          <chr>
## 1   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 2  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 5   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 6  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 9               TRP3                                 calcium influx channel protein
## 10            IP3R-3                  inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows

您需要稍微清理一下（View() 生成的数据框以了解我的意思）。

【讨论】：

非常感谢！我没有什么顾虑，1- 有时我看不到蛋白质 ID，但有描述，是否有可能在另一列中为每个蛋白质提供 db=和`id=`？我绝对接受你的回答。再次感谢