【问题标题】:collecting table data from a .asp webpage over with a for loop using RSelenium使用 RSelenium 通过 for 循环从 .asp 网页中收集表格数据
【发布时间】:2014-03-11 00:54:03
【问题描述】:

我正在尝试从http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx 收集村级印度人口普查数据

使用 RSelenium,我可以使用以下代码在四个下拉菜单中导航和选择不同的值:

require(RSelenium)
require(selectr)

#Setting up the proxy server
RSelenium::checkForServer()
RSelenium::startServer() # if needed
remDr <- remoteDriver$new()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")

#Finding and changing the menus
stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") 
stateElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
districtElem$sendKeysToElement(list(key = "enter"))
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
districtElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict") 
subdistrictElem$sendKeysToElement(list(key = "enter"))
subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict")
subdistrictElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") 
villageElem$sendKeysToElement(list(key = "enter"))
villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") 
villageElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") 
remDr$executeScript("arguments[0].click();", list(submitElem))

table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)

更大的问题。我需要为印度的所有村庄(选定州的村庄)运行此代码。计算时间不是问题。我有一个指定的计算机库,并计划将其拆分为多台机器。

但是,我需要弄清楚每个州有多少个区,每个区有多少个街道,每个街道有多少个村庄。所以我可以通过一个嵌套的 for 循环来运行它。

我想到的框架是这样的:

    num_states <- "code grabbing this from the options list"
    for(r in 1:length(num_states)){
      num_dist <- "code grabbing number of districts from the options list"
      stateElem_code_block[r]

      for(k in 1:length(num_dist)){
        num_subdist <- "code grabbing number of subdistricts from the options list" 
        districtElem_code_block[k]

        for(m in 1:length(num_subdist)){
         num_vill <- "code grabbing number of village from the options list" 
         subdistrictElem_code_block[m]

         for(i in 1:length(num_village)){
          villageElem_code_block[i]
          submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") 
          remDr$executeScript("arguments[0].click();", list(submitElem))
          table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)
          }
         tables <-rbind(tables, table) 
        }
      }
     }

对不起小说...我希望这是有道理的。非常感谢任何帮助

编辑:我自己解决了第一个问题....

【问题讨论】:

    标签: asp.net r selenium


    【解决方案1】:

    首先我要定义一个改变下拉列表的函数

    changeFun <- function(value, elementName, targetName){
      changeElem <- remDr$findElement(using = "name", elementName)
      script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
      remDr$executeScript(script, list(changeElem))
      targetElem <- remDr$findElement(using = "name", targetName) 
      target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
      targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
      target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
      list(target, targetCodes)
    }
    

    此脚本在下拉列表中设置值并使用 javascript 触发 onchange 事件。这样与网站的互动是最少的。此外,您可能想要运行像 phantomJS 这样的无头浏览器,而不是 firefox 请参阅RSelenium: Driving OS/Browsers local and remote 了解如何运行 phantomjs 的详细信息。

    remDr <- remoteDriver$new()
    remDr$open()
    remDr$setImplicitWaitTimeout(3000)
    remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")
    
    #STATES
    stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") 
    states <- stateElem$getElementAttribute("outerHTML")[[1]]
    stateCodes <- sapply(querySelectorAll(xmlParse(states), "option"), xmlGetAttr, "value")[-1]
    states <- sapply(querySelectorAll(xmlParse(states), "option"), xmlValue)[-1]
    
    state <- list()
    for(x in seq_along(stateCodes)){
      district <- changeFun(stateCodes[[x]], "ctl00$Body_Content$drpState", "ctl00$Body_Content$drpDistrict")
      subdistrict <- lapply(district[[2]], function(y){
        subdistrict <- changeFun(y, "ctl00$Body_Content$drpDistrict", "ctl00$Body_Content$drpSubDistrict")
        village <- lapply(subdistrict[[2]], function(z){
          village <- changeFun(z, "ctl00$Body_Content$drpSubDistrict", "ctl00$Body_Content$drpVillage")
          village}
        )
        list(subdistrict, village)}
      ) 
      state[[x]] <- list(district, subdistrict)
    }
    
    
    #
    

    state 现在将包含所有州、区、街道和村庄及其代码。 我只跑了 x = 1,即安达曼和尼科巴群岛的状态。例如,这里的数据是 尼科巴区。

    > state[[1]][[2]][[2]]
    [[1]]
    [[1]][[1]]
    [1] "Car Nicobar" "Nancowry"   
    
    [[1]][[2]]
    [1] "0001" "0002"
    
    
    [[2]]
    [[2]][[1]]
    [[2]][[1]][[1]]
     [1] "Arong"        "Big Lapati"   "Chuckchucha"  "IAF Camp"     "Kakana"      
     [6] "Kimois"       "Kinmai"       "Kinyuka"      "Malacca"      "Mus"         
    [11] "Perka"        "Sawai"        "Small Lapati" "Tamaloo"      "Tapoiming"   
    [16] "Teetop"      
    
    [[2]][[1]][[2]]
     [1] "00036000" "00037000" "00036800" "00036300" "00036200" "00036100"
     [7] "00037200" "00036700" "00036400" "00035700" "00036500" "00035900"
    [13] "00037100" "00036600" "00036900" "00035800"
    
    
    [[2]][[2]]
    [[2]][[2]][[1]]
      [1] "7 km Farm"                    "Akupa"                       
      [3] "Al-Hit-Touch/Balu Basti"      "Alexandera River"            
      [5] "Alhiat"                       "Alhitoth/Alhiloth"           
      [7] "Alipa/Alips"                  "Alkaipoh/Alkripoh"           
      [9] "Aloora"                       "Aloorang"                    
     [11] "Alreak"                       "Alsama"                      
     [13] "Altaful"                      "Altheak"                     
     [15] "Alukian/Alhukheck"            "Anul/Anula"                  
     [17] "Atkuna/Alkun"                 "Bahua"                       
     [19] "Banderkari/Pulu"              "Bengali"                     
     [21] "Berainak/Badnak"              "Bompoka Island"              
     [23] "Bumpal"                       "Campbell Bay"                
     [25] "Champin"                      "Chanel/Chanol"               
     [27] "Changua/Changup"              "Chaw Nallaha"                
     [29] "Chingen"                      "Chonghipoh"                  
     [31] "Chongkamong"                  "Chota Inak"                  
     [33] "Chukmachi"                    "Dairkurat"                   
     [35] "Dakhiyon (FC)"                "Danlet"                      
     [37] "Daring"                       "Dogmar River"                
     [39] "Elahi/Ilhoya"                 "Enam"                        
     [41] "Galathia River (FC)"          "Gandhi Nagar"                
     [43] "Govinda Nagar"                "Hakonhala"                   
     [45] "Halnatai/Hoinatai"            "Hin-Pou-Chi"                 
     [47] "Hindra"                       "Hinnunga"                    
     [49] "Hintona"                      "Hitlat"                      
     [51] "Hockook"                      "Hoin incl. Ikuia"            
     [53] "Hoipoh"                       "Hontona"                     
     [55] "Hutnyak"                      "In-Hig-Loi"                  
     [57] "Indira Point"                 "Inlock/Infock"               
     [59] "Inod"                         "Inroak/Chinlak"              
     [61] "Itoi"                         "Jansin"                      
     [63] "Jhoola"                       "Joginder Nagar"              
     [65] "Kakana"                       "Kalara"                      
     [67] "Kalasi"                       "Kamorta/Kalatapu"            
     [69] "Kamriak"                      "Kanahinot"                   
     [71] "Kapanga"                      "Kasintung"                   
     [73] "Katahu"                       "Katahuwa"                    
     [75] "Kavatinpeu/Karahinpoh"        "Kiyang"                      
     [77] "Knot"                         "Koe"                         
     [79] "Kokeon"                       "Kondul"                      
     [81] "Kopenheat"                    "Kuikua"                      
     [83] "Kuitasuk"                     "Kulatapangia"                
     [85] "Kumikia"                      "Kupinga"                     
     [87] "Lanuanga"                     "Lapat"                       
     [89] "Lawful"                       "Laxmi Nagar"                 
     [91] "Luxi"                         "Makhahu/Makachua"            
     [93] "Malacca"                      "Mapayala"                    
     [95] "Maru"                         "Masala Tapu"                 
     [97] "Mavatapis/Maratapia"          "Mildera"                     
     [99] "Minlana/Minlan"               "Minyuk"                      
    [101] "Mohreak/Kohreakap"            "Munak incl. Ponioo/Moul"     
    [103] "Mus"                          "Navy Dera"                   
    [105] "Neang"                        "Neeche Tapu"                 
    [107] "Not yet named (at 27.9 km)-A" "Nyicalang"                   
    [109] "Olinchi/Bombay"               "Olinpon/Alhinpon"            
    [111] "Ongulongho"                   "Patatiya"                    
    [113] "Payak"                        "Payuha"                      
    [115] "Pehayo"                       "Pilpilow"                    
    [117] "Pulloullo/Puloulo"            "Pulobaha"                    
    [119] "Pulobaha/Pathathifen"         "Pulobed"                     
    [121] "Pulobed/Lababu"               "Pulobha/Pulobahan"           
    [123] "Pulobhabi"                    "Pulokunji"                   
    [125] "Pulomilo"                     "Pulopanja"                   
    [127] "Pulopucca"                    "Pulotalia/Pulotohio"         
    [129] "Raihion"                      "Ramzoo"                      
    [131] "Ranganathan Bay"              "Reakomlong"                  
    [133] "Renguang"                     "Safedbalu"                   
    [135] "Safedbalu"                    "Sanaya"                      
    [137] "Sastri Nagar"                 "Shompen hut"                 
    [139] "Shompen Village-A"            "Shompen Village-B"           
    [141] "Sonomkuwa"                    "Tahaila"                     
    [143] "Tani"                         "Tapani/Tapainy"              
    [145] "Tapiang"                      "Tapong incl. Kabila"         
    [147] "Tavinkin/Tavakin"             "Tillang Chong Island"        
    [149] "Tomae/Inmae"                  "Trinket"                     
    [151] "Vijoy Nagar"                  "Vikas Nagar"                 
    [153] "Vyavtapu"                     "W.B.Katchal/Hindra"          
    
    [[2]][[2]][[2]]
      [1] "00053600" "00048500" "00043500" "00050800" "00037500" "00039800"
      [7] "00042600" "00039700" "00038000" "00037900" "00044200" "00042100"
     [13] "00041900" "00043400" "00046200" "00048300" "00041100" "00050300"
     [19] "00045700" "00038900" "00047000" "00039000" "00045000" "00054000"
     [25] "00043700" "00045400" "00046000" "00054400" "00052900" "00039500"
     [31] "00037400" "00046900" "00038400" "00050700" "00052400" "00051900"
     [37] "00045200" "00051400" "00050000" "00038100" "00053000" "00053200"
     [43] "00053900" "00041400" "00041800" "00052200" "00042800" "00043800"
     [49] "00044300" "00039300" "00047700" "00049200" "00040800" "00040500"
     [55] "00040200" "00052600" "00052800" "00048600" "00050100" "00044000"
     [61] "00044100" "00039200" "00039100" "00053500" "00047200" "00038300"
     [67] "00038800" "00046800" "00040100" "00038700" "00042200" "00051700"
     [73] "00050600" "00039900" "00042000" "00049000" "00046300" "00051800"
     [79] "00052300" "00050400" "00051500" "00047500" "00037600" "00040600"
     [85] "00040000" "00042300" "00043300" "00042700" "00054600" "00053300"
     [91] "00038200" "00048400" "00043600" "00040900" "00045300" "00046100"
     [97] "00039400" "00042400" "00048200" "00038600" "00047400" "00046600"
    [103] "00042900" "00054500" "00043100" "00044600" "00053700" "00047300"
    [109] "00049700" "00044900" "00040300" "00052100" "00043000" "00046500"
    [115] "00050200" "00044500" "00049100" "00052700" "00048800" "00051000"
    [121] "00050500" "00049400" "00052000" "00051100" "00048100" "00049900"
    [127] "00052500" "00048700" "00037700" "00046700" "00054100" "00041500"
    [133] "00051300" "00047600" "00038500" "00039600" "00053100" "00053800"
    [139] "00051200" "00051600" "00041600" "00037300" "00041200" "00043900"
    [145] "00047900" "00043200" "00041700" "00037800" "00045900" "00047800"
    [151] "00053400" "00047100" "00040700" "00042500"
    

    印度有 600,000 个村庄 :O 所以最好将状态作为 for 循环。获得四个必要的代码后,您可以通过单独提交表单来获取村庄数据。例如,为获取村庄详细信息而发布的部分表单

    州:安达曼和尼科巴群岛

    地区:尼科巴

    街道:Car Nicobar

    村庄:阿荣

    ctl00$Body_Content$btnSub...    Submit
    ctl00$Body_Content$drpDis...    02
    ctl00$Body_Content$drpSta...    35
    ctl00$Body_Content$drpSub...    0001
    ctl00$Body_Content$drpVil...    00036000
    

    更新:

    出于兴趣,我在 x = 1 上使用 phantomJS 运行,这是安达曼和尼科巴群岛的状态,并稍微修改了 changeFun

    changeFun <- function(value, elementName, targetName){
      changeElem <- remDr$findElement(using = "name", elementName)
      script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
      remDr$executeScript(script, list(changeElem))
      targetCodes <- c()
      while(length(targetCodes) == 0){
        targetElem <- remDr$findElement(using = "name", targetName) 
        target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
        targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
        target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
        if(length(targetCodes) == 0){
          Sys.sleep(0.5)
        }else{
          out <- list(target, targetCodes)
        }
      }
      return(out)
    }
    

    获取数据需要 3 秒,而 firefox 获取相同数据需要 43 秒。

    【讨论】:

    • 谢谢!这是太棒了。另外,感谢 RSelenium 软件包。彻底改变了我的数据 diss 收集。
    【解决方案2】:

    我能够使用以下代码找到每个州的地区数量:

    districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
    districtElem$sendKeysToElement(list(key = 'enter')) 
    districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
    stuff <- districtElem$describeElement()$text
    dist_num <- length(unlist(strsplit(stuff, "\\n")))-1 
    dist_num
    

    其他嵌套循环的长度可以类似推导出来。虽然它肯定是低效的,但它仍然是一个解决方案。

    仍在为此类项目学习更有效的方法....

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-02-21
      • 1970-01-01
      • 1970-01-01
      • 2018-07-24
      • 2019-12-21
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多