【问题标题】:R: Trouble in appending rows in a dataframe from webscraping in r [duplicate]R:在 r [重复] 中从网络抓取中附加数据框中的行时遇到问题
【发布时间】:2017-06-16 09:40:14
【问题描述】:

我有 7 行 1 列的数据框,其中包含网站的链接,我正在尝试从这些不同的链接中提取数据并将它们存储在数据框中,但无法附加。我也在检查如果没有记录的链接(我正在通过该链接的 html 属性检查)跳过该链接并继续到下一个链接。我也在尝试获取链接的多个页面的数据。

这是可重复的数据

text1="http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom="
text3="&proptype="
text4="Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment"
text5="&cityName=Thane&BudgetMin="
text6="&BudgetMax="

bhk=c("1","2","3","4","5",">5")
budg_min=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
budg_max=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
eg <- expand.grid(bhk = bhk, budg_min = budg_min, budg_max = budg_max)
eg <- eg[as.integer(eg$budg_min) <= as.integer(eg$budg_max),]
uuu <- sprintf("%s%s%s%s%s%s%s%s", text1,eg[,1],text3,text4,text5,eg[,2],text6,eg[,3])
uuu_df1=data.frame(x=uuu[1:7,])
dput(uuu_df1)

我有 3 个解决方案,但似乎没有一个工作正常。

解决方案#1

urlList <- llply(uuu_df1[,1], function(url){     

  this_pg <- read_html(url)

  results_count <- this_pg %>% 
    xml_find_first(".//span[@id='resultCount']") %>% 
    xml_text() %>%
    as.integer()

  if(results_count > 0){

    cards <- this_pg %>% 
      xml_find_all('//div[@class="SRCard"]')

    df <- ldply(cards, .fun=function(x){
      y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
                      excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
                      locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
                      society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
      return(y)
    })

  } else {
    df <- NULL
  }

  return(df)   
}, .progress = 'text')
names(urlList) <- uuu_df1[,1]

a=bind_rows(urlList)

上面的代码给了我错误Error in if (results_count &gt; 0) { : missing value where TRUE/FALSE needed

解决方案#2

urlList <- lapply(uuu_df1[,1], function(url){     

  UrlPage <- html(as.character(url))
  ImgNode <- UrlPage %>% html_node("div.noResultHead")
  u <- paste("No", word(string = as(ImgNode, "character"), start=4, end=5), sep=" ")

  cat(".")        
  pg <- read_html(url)

  if(u!="No Results Found!") {
    df <- data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
                     excerpt=html_text(html_nodes(pg, ".postedOn")),
                     locality=html_text(html_nodes(pg,".localityFirst")),
                     society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
                     stringsAsFactors=FALSE)
  } else {
    # ASSIGN EMPTY DATAFRAME (FOR CONSISTENT STRUCTURE)
    df <- data.frame(wine=character(), excerpt=character(), locality=character(), society=character())
  }
  # RETURN NAMED LIST
  return(list(UrlPage=UrlPage, ImgNode=ImgNode, u=u, df=df))    
})

# ROW BIND ONLY DATAFRAME ELEMENT FROM LIST
wines <- map_df(urlList, function(u) u$df)

上面的代码给出了空数据框

解决方案#3

uuu_df1=data.frame(x=uuu_df[1:7,])
wines=data.frame()
url_test=c()
UrlPage_test=c()
u=c()
ImgNode=c()
pg=c()

for(i in 1:dim(uuu_df1)[1]) {

  url_test[i]=as.character(uuu_df1[i,])
  UrlPage_test[i] <- html(url_test[i])
  ImgNode[i] <- UrlPage_test[i] %>% html_node("div.noResultHead")
  u[i]=ImgNode[i]
  u[i]=as(u[i],"character")
  u[i]=paste("No",word(string = u, start = 4, end = 5),sep = " ")

  if(u[i]=="No Results Found!") next
  {
    map_df(1:5, function(i) # here 1:5 is number of webpages of a website 
    {

      # simple but effective progress indicator
      cat(".")

      pg[i] <- read_html(sprintf(url_test[i], i))

      data.frame(wine=html_text(html_nodes(pg[i], ".agentNameh")),
                 excerpt=html_text(html_nodes(pg[i], ".postedOn")),
                 locality=html_text(html_nodes(pg[i],".localityFirst")),
                 society=html_text(html_nodes(pg[i],'.labValu .stop-propagation:nth-child(1)')),
                 stringsAsFactors=FALSE)

    }) -> wines

  }}

上面的代码也报错

Error in UseMethod("xml_find_first") : 
  no applicable method for 'xml_find_first' applied to an object of class "list"
In addition: Warning messages:
1: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated") 
2: In UrlPage_test[i] <- html(url_test[i]) :
  number of items to replace is not a multiple of replacement length

关于可以纠正哪些代码以满足我的要求的任何建议。提前致谢

【问题讨论】:

    标签: r for-loop dataframe web-scraping rvest


    【解决方案1】:

    解决方案 #1

    当您执行以下操作时会打印 missing value where TRUE/FALSE needed

    if (NA > 0) {
        do something
    }
    

    所以替换你的 if 条件

    if(results_count > 0)
    

    (!is.na(results_count) & (results_count > 0))
    

    【讨论】:

    • 优秀的@herbaman 一行代码拯救了我的一天。它运作良好..非常感谢..感谢您的努力!!!
    • 如果您选中“仅针对第 7 条记录”,则该链接显示它有 94 条记录,但如果您仅针对第 7 条记录运行该代码,则创建的数据框仅包含 30 条记录而不是 94 条......为什么是这样吗??
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-07-09
    • 1970-01-01
    • 1970-01-01
    • 2018-09-10
    • 1970-01-01
    • 2020-08-23
    • 1970-01-01
    相关资源
    最近更新 更多