【问题标题】:Memory leak when using package XML on Windows在 Windows 上使用包 XML 时出现内存泄漏
【发布时间】:2014-07-04 23:20:02
【问题描述】:

阅读了Memory leaks parsing XML in r(包括链接的帖子)和this R Help 上的帖子,并且考虑到又过了一段时间,我仍然认为这是一个未解决的问题,值得关注,因为XML 包被广泛使用在整个 R 宇宙中。

因此,请将此视为后续帖子和/或参考,并希望能提供信息丰富但简明扼要的问题说明

问题

解析 XML/HTML 文档以便之后可以使用XPath 搜索它们需要内部使用 C 指针 (AFAIU)。而且似乎至少在 MS Windows 上(我在 Windows 8.1、64 位上运行)这些引用没有被垃圾收集器正确识别。因此消耗的内存没有被正确释放,这会导致 R 进程在某些时候冻结。

迄今为止的主要发现

在我看来,XML:free 和/或 gc 在通过 xmlParsehtmlParse 解析 XML/HTML 文档并随后处理它们时识别/不识别所有所涉及的内存与xpathApply 或类似:

OS 任务 (Rterm.exe) 报告的内存使用量正在增加非常快,而报告的 R 进程的内存为“从 R 内部看到” (函数memory.size)适度增加(相比之下,即)。请参阅下面的大量解析周期前后的列表元素 mem_rmem_osratio

总而言之,加上推荐的所有内容(freermgc),当调用xmlParse 等时,内存使用量仍然总是增加。只是多少的问题。所以恕我直言,肯定还有一些东西不能正常工作。


插图

我从 Duncan's Omegahat git repository 借用了分析代码。

一些准备工作:

Sys.setenv("LANGUAGE"="en")   
require("compiler")
require("XML")

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] XML_3.98-1.1

我们需要的功能:

getTaskMemoryByPid <- cmpfun(function(
    pid=Sys.getpid()
) {
    cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
    mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
    mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
    mem
}, options=list(suppressAll=TRUE))  

memoryLeak <- cmpfun(function(
    x=system.file("exampleData", "mtcars.xml", package="XML"),
    n=10000,
    use_text=FALSE,
    xpath=FALSE,
    free_doc=FALSE,
    clean_up=FALSE,
    detailed=FALSE
) {
    if(use_text) {
        x <- readLines(x)
    }
    ## Before //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_1  <- memory.profile()
    mem_before <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)

    ## Per run //
    mem_perrun <- lapply(1:n, function(ii) {
        doc <- xmlParse(x, asText=use_text)
        if (xpath) {
            res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
            rm(res)
        }
        if (free_doc) {
            free(doc)
        }
        rm(doc)
        out <- NULL
        if (detailed) {
            out <- list(
                profile=memory.profile(),
                size=memory.size()
            )
        } 
        out
    })
    has_perrun <- any(sapply(mem_perrun, length) > 0)
    if (!has_perrun) {
        mem_perrun <- NULL
    } 

    ## Garbage collect //
    mem_gc <- NULL
    if(clean_up) {
        gc()
        tmp <- gc()
        mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
    }

    ## After //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_2  <- memory.profile()
    mem_after <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)
    list(
        before=mem_before, 
        perrun=mem_perrun, 
        gc=mem_gc, 
        after=mem_after, 
        comparison_r=data.frame(
            before=prof_1, 
            after=prof_2, 
            increase=round((prof_2/prof_1)-1, 4)
        ),
        increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
        increase_os=(mem_after$mem_os/mem_before$mem_os)-1
    )
}, options=list(suppressAll=TRUE))  

结果

场景 1

速览:垃圾收集已启用,XML 文档被解析 n 次,但通过 xpathApply 搜索

注意 OS 内存与 R 内存的比率:

之前:1.364832

之后:1.322702

res <- memoryLeak(clean_up=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))

> res
$before
$before$mem_r
[1] 37.42

$before$mem_os
[1] 51.072

$before$ratio
[1] 1.364832


$perrun
NULL

$gc
$gc$gc_mb
[1] 45


$after
$after$mem_r
[1] 63.21

$after$mem_os
[1] 83.608

$after$ratio
[1] 1.322702


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7387   7392   0.0007
pairlist    190383 390633   1.0518
closure       5077  55085   9.8499
environment   1032  51032  48.4496
promise       5226 105226  19.1351
language     54675  54791   0.0021
special         44     44   0.0000
builtin        648    648   0.0000
char          8746   8763   0.0019
logical       9081   9084   0.0003
integer      22804  22807   0.0001
double        2773   2783   0.0036
complex          1      1   0.0000
character    44522  94569   1.1241
...              0      0      NaN
any              0      0      NaN
list         19946  19951   0.0003
expression       1      1   0.0000
bytecode     16049  16050   0.0001
externalptr   1487   1487   0.0000
weakref        391    391   0.0000
raw            392    392   0.0000
S4            1392   1392   0.0000

$increase_r
[1] 0.6892036

$increase_os
[1] 0.6370614

场景 2

简单的事实:垃圾收集已启用,free 被显式调用,XML 文档被解析 n 次,但没有通过 xpathApply 搜索。

注意 OS 内存与 R 内存的比率:

之前:1.315249

之后:1.222143

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res

$before    
$before$mem_r
[1] 63.48

$before$mem_os
[1] 83.492

$before$ratio
[1] 1.315249


$perrun
NULL

$gc
$gc$gc_mb
[1] 69.3


$after
$after$mem_r
[1] 95.92

$after$mem_os
[1] 117.228

$after$ratio
[1] 1.222143


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7454   0.0000
pairlist    392455 592466   0.5096
closure      55104 105104   0.9074
environment  51032 101032   0.9798
promise     105226 205226   0.9503
language     55592  55592   0.0000
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8848   0.0001
logical       9141   9141   0.0000
integer      23109  23111   0.0001
double        2802   2807   0.0018
complex          1      1   0.0000
character    94775 144781   0.5276
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.5110271

$increase_os
[1] 0.4040627

场景 3

简单的事实:垃圾收集已启用,free 被显式调用,XML 文档被解析n 次,并且每次都通过xpathApply 搜索

注意 OS 内存与 R 内存的比率:

之前:1.220429

之后:13.15629 (!)

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94

$before$mem_os
[1] 117.088

$before$ratio
[1] 1.220429


$perrun
NULL

$gc
$gc$gc_mb
[1] 93.4


$after
$after$mem_r
[1] 124.64

$after$mem_os
[1] 1639.8

$after$ratio
[1] 13.15629


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7460   0.0008
pairlist    592458 793042   0.3386
closure     105104 155110   0.4758
environment 101032 151032   0.4949
promise     205226 305226   0.4873
language     55592  55882   0.0052
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8867   0.0023
logical       9142   9162   0.0022
integer      23109  23112   0.0001
double        2802   2832   0.0107
complex          1      1   0.0000
character   144775 194819   0.3457
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.2991453

$increase_os
[1] 13.00485

我也尝试了不同的版本。好吧,我试过试试 ;-)

来自源代码,来自 omegahat.org

仅供参考:最新的 Rtools 3.1 已安装并包含在 Windows PATH 中(例如,从源代码安装 stringr 效果很好)。

> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'

The downloaded source packages are in
    'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 
2: In install.packages("XML", repos = "http://www.omegahat.org/R",  :
  installation of package 'XML' had non-zero exit status

Github

我没有遵循 github repo 上 README 中的建议,因为它指向的 this directory 仅包含版本 3.94-0tar.gz(而我们在 CRAN 上位于 3.98-1.1)。

尽管说 gihub 存储库不是标准的 R 包结构,但我还是用 install_github 尝试了它 - 并且失败了 ;-)

require("devtools")
> install_github(repo="XML", username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL  \
  "C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master"  \
  --library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source  \
  --install-tests 

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command failed (1)

【问题讨论】:

  • 这不是 github 问题而不是 Stack Overflow 问题吗?至少作者看到它的机会更大。
  • 嗯,你说得有道理;-) 没想到那么远。但我已经就此联系过 Duncan Temple Lang。
  • 伟大的调查。如果问题得到确认和/或解决,请发布答案,看看这在哪里结束会很有趣。
  • @tonytonov 谢谢,伙计。当然,我会及时通知您!
  • @Rappster 您的调查是否取得了更多进展?我通过单步调试和调试源代码尝试了自己的测试,但在调试模式下我一直遇到错误,所以我没有走得太远。

标签: xml windows r parsing memory-leaks


【解决方案1】:

虽然它仍处于起步阶段(只有几个月大!),并且有一些怪癖,但 Hadley Wickham 编写了一个用于 XML 解析的库,xml2,可以在 Github 上找到 https://github.com/hadley/xml2 .它仅限于读取而不是写入 XML,但是对于解析 XML,我一直在试验,看起来它可以完成这项工作,而不会出现 xml 包的内存泄漏!它提供的功能包括:

  • read_xml() 读取 XML 文件
  • xml_children()获取一个节点的子节点
  • xml_text() 获取标签内的文本
  • xml_attrs() 获取节点属性和值的字符向量,可以使用as.list() 将其转换为命名列表

请注意,您仍然需要确保在处理完 XML 节点对象后rm(),并使用gc() 强制进行垃圾回收,但内存实际上确实会释放到 O/S (免责声明:仅在 Windows 7 上测试,但这似乎是最“内存泄漏”的平台)。

希望这对某人有所帮助!

【讨论】:

  • 太棒了!!!我在 Hadley 的一些 GitHub 问题/cmets 中偶然发现了xml2,并默默地希望这意味着他正在用 R 重新实现 XML 解析器。很高兴听到它实际上就是这样,并非常感谢指针!
【解决方案2】:

按照上面 Matthew Wise 对使用 xml2 的回答,我发现真正释放内存的函数是 xml_remove(),后跟 gc(),而不是 rm()

【讨论】:

  • 感谢您的跟进!
【解决方案3】:

自从我发布这个问题后,什么都没发生,所以我想我会再次尝试引起注意。

这是我的调查的略微更新版本

预赛

require("rvest")
require("XML")

功能

getTaskMemoryByPid <- function(
  pid = Sys.getpid()
) {
  cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
  mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
  mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
  mem
}  

getCurrentMemoryStatus <- function() {
  mem_os  <- getTaskMemoryByPid()
  mem_r   <- memory.size()
  prof_1  <- memory.profile()
  list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}

memoryLeak <- function(
  x = system.file("exampleData", "mtcars.xml", package="XML"),
  n = 10000,
  use_text = FALSE,
  xpath = FALSE,
  free_doc = FALSE,
  clean_up = FALSE,
  detailed = FALSE,
  use_rvest = FALSE,
  user_agent = httr::user_agent("Mozilla/5.0")
) {
  if(use_text) {
    x <- readLines(x)
  }
  ## Before //
  prof_1  <- memory.profile()
  mem_before <- getCurrentMemoryStatus()

  ## Per run //
  mem_perrun <- lapply(1:n, function(ii) {
    doc <- if (!use_rvest) {
      xmlParse(x, asText = use_text)
    } else {
      if (file.exists(x)) {
      ## From disk //        
        rvest::html(x)  
      } else {
      ## From web //
        rvest::html_session(x, user_agent)  
      }
    }
    if (xpath) {
      res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
      rm(res)
    }
    if (free_doc) {
      free(doc)
    }
    rm(doc)
    out <- NULL
    if (detailed) {
      out <- list(
        profile = memory.profile(),
        size = memory.size()
      )
    } 
    out
  })
  has_perrun <- any(sapply(mem_perrun, length) > 0)
  if (!has_perrun) {
    mem_perrun <- NULL
  } 

  ## Garbage collect //
  mem_gc <- NULL
  if(clean_up) {
    gc()
    tmp <- gc()
    mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
  }

  ## After //
  prof_2  <- memory.profile()
  mem_after <- getCurrentMemoryStatus()

  ## Return value //
  if (detailed) {
    list(
      before = mem_before, 
      perrun = mem_perrun, 
      gc = mem_gc, 
      after = mem_after, 
      comparison_r = data.frame(
        before = prof_1, 
        after = prof_2, 
        increase = round((prof_2/prof_1)-1, 4)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  } else {
    list(
      before_after = data.frame(
        r = c(mem_before$r, mem_after$r),
        os = c(mem_before$os, mem_after$os)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  }
}

请求任何内容之前的内存状态

getCurrentMemoryStatus()

生成额外的离线示例内容

s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")

s <- html_session(
  "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")

getCurrentMemoryStatus()

分析

################
## Mtcars.xml ##
################

res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)

###################
## www.had.co.nz ##
###################

## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)

####################
## www.amazon.com ##
####################

## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-11-09
    • 2013-11-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-04-24
    相关资源
    最近更新 更多