【问题标题】:Exception handling RSelenium switchToFrame() Error: ElementNotVisible异常处理 RSelenium switchToFrame() 错误:ElementNotVisible
【发布时间】:2019-06-02 17:22:49
【问题描述】:

我正在尝试在RSelenium 中实现异常处理,需要帮助。请注意,我已经检查了使用 robotstxt 包抓取此页面的权限。

library(RSelenium)
library(XML)
library(janitor)
library(lubridate)
library(magrittr)
library(dplyr)

remDr <- remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open()

# Open TightVNC to follow along as RSelenium drives the browser

# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")

# look for table element
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")

# switch to table
remDr$switchToFrame(tableElem)

# parse html for first table
doc <- htmlParse(remDr$getPageSource()[[1]])
table_tmp <- readHTMLTable(doc)
table_tmp <- table_tmp[[1]][-2, -1]
table_tmp <- table_tmp[-1, ]
colnames(table_tmp) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
table_tmp$city <- rep("montreal", nrow(table_tmp))
table_tmp$date <- rep(Sys.Date() - 5, nrow(table_tmp))

# switch back to the main/outer frame
remDr$switchToFrame(NULL)

# I found the elements I want to manipulate with Inspector mode in a browser
webElems <- remDr$findElements(using = "css", ".switcherItem") # Month/Year tabs at the bottom
arrowElems <- remDr$findElements(using = "css", ".switcherArrows") # Arrows to scroll left and right at the bottom

# Create NULL object to be used in for loop
big_df <- NULL
for (i in seq(length(webElems))) {

  # choose the i'th Month/Year tab
  webElem <- webElems[[i]]
  webElem$clickElement()

  tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame

  # switch to table frame
  remDr$switchToFrame(tableElem)
  Sys.sleep(3)
  # parse html with XML package
  doc <- htmlParse(remDr$getPageSource()[[1]])
  Sys.sleep(3)
  # Extract data from HTML table in HTML document
  table_tmp <- readHTMLTable(doc)
  Sys.sleep(3)
  # put this into a format you can use
  table <- table_tmp[[1]][-2, -1]
  table <- table[-1, ]
  # rename the columns
  colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
  # add city name to a column
  table$city <- rep("Montreal", nrow(table))

  # add the Month/Year this table was extracted from
  today <- Sys.Date() %m-% months(i + 1)
  table$date <- today

  # concatenate each table together
  big_df <- dplyr::bind_rows(big_df, table)

  # Switch back to main frame
  remDr$switchToFrame(NULL)

  ################################################
  ###   I should use exception handling here   ###
  ################################################


}

当浏览器到达January 2018 表时,它无法再找到下一个webElems 元素并抛出错误:

Selenium 消息:元素当前不可见,因此可能无法与之交互 构建信息:版本:'2.53.1',修订:'a36b8b1',时间:'2016-06-30 17:37:03' 系统信息:主机:'617e51cbea11',ip:'172.17.0.2',os.name:'Linux',os.arch:'amd64',os.version:'4.14.79-boot2docker',java.version:' 1.8.0_91' 驱动信息:driver.version:未知

错误:摘要:ElementNotVisible 详细信息:元素命令无法完成,因为该元素在页面上不可见。 类:org.openqa.selenium.ElementNotVisibleException 更多细节:运行 errorDetails 方法 另外:有50个或更多的警告(使用warnings()查看前50个)

我一直很天真地处理它,将这段代码包含在 for 循环的末尾。这不是一个好主意,原因有两个:1)滚动速度很难弄清楚,并且会在其他(更长的)谷歌页面上失败,2)当它尝试点击右箭头时,for循环最终失败,但是它已经在最后 - 因此它不会下载最后几张表。

# click the right arrow to scroll right
arrowElem <- arrowElems[[1]]
# once you "click"" the element it is "held down" - no way to " unclick" to prevent it from scrolling too far
# I currently make sure it only scrolls a short distance - via Sys.sleep() before switching to outer frame
arrowElem$clickElement()
# give it "just enough time" to scroll right
Sys.sleep(0.3)
# switch back to outer frame to re-start the loop
remDr$switchToFrame(NULL)

我希望在出现此错误时通过执行arrowElem$clickElement() 来处理此异常。我认为通常会使用tryCatch();不过,这也是我第一次学习异常处理。我以为我可以将它包含在 for 循环的 remDr$switchToFrame(tableElem) 部分中,但它不起作用:

tryCatch({
        suppressMessages({
            remDr$switchToFrame(tableElem)
        })
    },
    error = function(e) {
        arrowElem <- arrowElems[[1]]
        arrowElem$clickElement()
        Sys.sleep(0.3)
        remDr$switchToFrame(NULL)
    }
)

【问题讨论】:

    标签: r exception-handling try-catch rselenium


    【解决方案1】:

    我试了一下。在处理异常时,我喜欢使用某种形式的东西

    check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))
    if (any(class(check) == "try-error")) {
      # do stuff
    }
    

    我发现它使用起来很方便,而且通常可以正常工作,包括在使用 selenium 时。然而,这里遇到的问题是单击箭头一次总是会将我带到 last 可见工作表 - 跳过中间的所有内容。


    替代解决方案

    所以这里有一个替代方案,可以解决* scraping表* 不是上述意义上的异常处理任务。

    代码

    # Alernative: -------------------------------------------------------------
    
    remDr <- RSelenium::remoteDriver(
      remoteServerAddr = "192.168.99.100",
      port = 4445L
    )
    remDr$open(silent = TRUE)
    # navigate to the main page
    # needs no be done once before looping, else content is not available
    remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
    
    
    # I. Preliminaries:
    # 
    # 1. build the links to all spreadsheets
    # 2. define the function create_table
    # 
    # 1.
    # get page source
    html <- remDr$getPageSource()[[1]]
    # split it line by line
    html <- unlist(strsplit(html, '\n'))
    # restrict to script section
    script <- grep('^\\s*var\\s+gidMatch', html, value = TRUE)
    # split the script by semi-colon
    script <- unlist(strsplit(script, ';'))
    # retrieve information
    sheet_months <- gsub('.*name:.{2}(.*?).{1},.*', '\\1', 
                         grep('\\{name\\s*\\:', script, value = TRUE), perl = TRUE)
    sheet_gid <- gsub('.*gid:.{2}(.*?).{1},.*', '\\1', 
                      grep('\\gid\\s*\\:', script, value = TRUE), perl = TRUE)
    sheet_url <- paste0('https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers%5Cx3dfalse&gid=',
                        sheet_gid)
    #
    # 2. 
    # table yielding function
    # just for readability in the loop
    create_table <- function (remDr) {
      # parse html with XML package
      doc <- XML::htmlParse(remDr$getPageSource()[[1]])
      Sys.sleep(3)
      # Extract data from HTML table in HTML document
      table_tmp <- XML::readHTMLTable(doc)
      Sys.sleep(3)
      # put this into a format you can use
      table <- table_tmp[[1]][-2, -1]
      # add a check-up for size mismatch
      table_fields <- as.character(t(table[1,]))
      if (! any(grepl("size", tolower(table_fields)))) {
        table <- table[-1, ]
        # rename the columns
        colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved")
        table$team_size <- NA_integer_
        table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")]
      } else {
        table <- table[-1, ]
        # rename the columns
        colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
      }
      # add city name to a column
      table$city <- rep("Montreal", nrow(table))
      
      # add the Month/Year this table was extracted from
      today <- Sys.Date()
      lubridate::month(today) <- lubridate::month(today)+1
      table$date <- today
      
      # returns the table
      table
    }
    
    # II. Scrapping the content
    # 
    # 1. selenium to generate the pages
    # 2. use create_table to extract the table
    # 
    big_df <- NULL
    for (k in seq_along(sheet_url)) {
      # 1. navigate to the page
      remDr$navigate(sheet_url[k])
      # remDr$screenshot(display = TRUE) maybe one wants to see progress
      table <- create_table(remDr)
      
      # 2. concatenate each table together
      big_df <- dplyr::bind_rows(big_df, table)
      
      # inform progress 
      cat(paste0('\nGathered table for: \t', sheet_months[k]))
    }
    
    # close session
    remDr$close()
    

    结果

    在这里你可以看到big_dfheadtail

    head(big_df)
    #                             team_name team_size start_time end_time total_time puzzels_solved     city       date
    # 1                     Tortoise Tortes         5      19:00    20:05       1:05              5 Montreal 2019-02-20
    # 2 Mulholland Drives Over A Smelly Cat         4       7:25     8:48       1:23              5 Montreal 2019-02-20
    # 3                          B.R.O.O.K.         2       7:23     9:05       1:42              5 Montreal 2019-02-20
    # 4                            Motivate         4      18:53    20:37       1:44              5 Montreal 2019-02-20
    # 5                  Fighting Mongooses         3       6:31     8:20       1:49              5 Montreal 2019-02-20
    # 6                            B Lovers         3       6:40     8:30       1:50              5 Montreal 2019-02-20
    tail(big_df)
    #                             team_name team_size start_time end_time total_time puzzels_solved     city       date
    # 545                          Ale Mary      <NA>       6:05     7:53       1:48              5 Montreal 2019-02-20
    # 546                        B.R.O.O.K.      <NA>      18:45    20:37       1:52              5 Montreal 2019-02-20
    # 547                        Ridler Co.      <NA>       6:30     8:45       2:15              5 Montreal 2019-02-20
    # 548                        B.R.O.O.K.      <NA>      18:46    21:51       3:05              5 Montreal 2019-02-20
    # 549        Rotating Puzzle Collective      <NA>      18:45    21:51       3:06              5 Montreal 2019-02-20
    # 550                         Fire Team      <NA>      19:00    22:11       3:11              5 Montreal 2019-02-20
    

    简短说明

    1. 要执行这项任务,我首先要生成文档中所有电子表格的链接。为此:

      • 导航到文档一次
      • 提取源代码
      • 使用regex 提取工作表月份和URL(通过gid 数字)
    2. 完成后,循环访问 URL,收集并绑定表格

    另外,为了便于阅读,我创建了一个名为create_table 的小函数,它同时以正确的格式返回表格。它主要是循环中包含的代码。我只添加了列数的安全措施(一些电子表格没有team_size 字段 - 在这些情况下我将其设置为NA_integer)。

    【讨论】:

    • 非常感谢@nate 的帮助。像您在上面发布的异常处理表单会在 within for-loops 中工作,还是应该将 for-loop 本身包含在 # do stuff 之后的 try() 函数中?我猜这在这种情况下不起作用,因为即使我尝试通过以下方式停止它也会自动滚动单击右箭头:arrowElem$clickElement() Sys.sleep(0.3) remDr$switchToFrame(NULL)
    • @MatthewJ.Oldach 在我的情况下,异常处理技术上(在循环内)起作用了。问题是arrowElem$clickElement 总是会把我送到最后,不管Sys.sleep 是什么,所以循环运行了大约前12 个webElems,然后马上跳到最后12 个。构造一个try 工作,但是它不会解决主要目标。这就是我走向另一个方向的原因。我在想另一种解决方法是找到箭头的位置并将鼠标悬停(而不是单击)片刻。
    猜你喜欢
    • 1970-01-01
    • 2012-09-15
    • 2011-04-03
    • 2014-01-08
    • 2014-04-06
    • 2013-12-27
    • 2013-05-30
    • 1970-01-01
    • 2017-05-29
    相关资源
    最近更新 更多