【问题标题】:Gathering data using R - multiple urls使用 R 收集数据 - 多个 url
【发布时间】:2026-01-27 15:50:01
【问题描述】:

我有一个数据框,它有几列和几行 - 有些包含信息,有些用 NA 填充,应该用某些数据替换。

行代表特定的工具,列包含给定行中工具的各种详细信息。数据框的最后一列有每个工具的 url,然后将用于获取空列的数据:

 Issuer  NIN or ISIN           Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes            NA         NA           NA         NA       NA
2 NBRK KZW1KD079146 discount notes            NA         NA           NA         NA       NA
3 NBRK KZW1KD079153 discount notes            NA         NA           NA         NA       NA
4 NBRK KZW1KD089137 discount notes            NA         NA           NA         NA       NA

 URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913

例如,使用以下代码,我可以获取NBRK KZW1KD079112 行中第一个仪器的详细信息:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

这给出了以下内容:

                                            V1                                                              

    V2
1                                     Trading code                                                         NTK007_1911
2                               List of securities                                                            official
3                              System of quotation                                                               price
4                                Unit of quotation                                   nominal value percentage fraction
5                               Quotation currency                                                                 KZT
6                               Quotation accuracy                                                        4 characters
7                       Trade lists admission date                                                            04/21/17
8                               Trade opening date                                                            04/24/17
9                       Trade lists exclusion date                                                            04/28/17
10                                        Security                                                                <NA>
11                                     Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12                                            NSIN                                                        KZW1KD079112
13                   Currency of issue and service                                                                 KZT
14               Nominal value in issue's currency                                                              100.00
15                      Number of registered bonds                                                       1,929,319,196
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
18 Settlement basis (days in month / days in year)                                                        actual / 365
19                       Date of circulation start                                                            04/21/17
20                          Circulation term, days                                                                   7
21              Register fixation date at maturity                                                            04/27/17
22                        Principal repayment date                                                            04/28/17
23                                    Paying agent                          Central securities depository JSC (Almaty)
24                                       Registrar                          Central securities depository JSC (Almaty)

从此,我只需要保留:

14               Nominal value in issue's currency                                                              100.00
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
19                       Date of circulation start                                                            04/21/17
22                        Principal repayment date                                                            04/28/17

然后我会将所需的数据复制到初始数据帧并继续下一行...数据帧由 100 多行组成,并且会不断变化。

我将不胜感激。

更新:

看起来我需要的数据并不总是在sp[[4]] 中。有时它的sp[[7]],也许将来它会完全不同。有什么方法可以在抓取的表中查找信息并识别可用于进一步收集数据的特定表?:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

【问题讨论】:

    标签: r xml loops web-scraping


    【解决方案1】:
    library(XML)
    library(reshape2)
    library(dplyr)
    
    name = c(
    "NBRK KZW1KD079112 discount notes",                                           
    "NBRK KZW1KD079146 discount notes",                                        
    "NBRK KZW1KD079153 discount notes",                                         
    "NBRK KZW1KD089137 discount notes")                                           
    
    URL = c(
    "http://www.kase.kz/en/gsecs/show/NTK007_1911",
    "http://www.kase.kz/en/gsecs/show/NTK007_1914",
    "http://www.kase.kz/en/gsecs/show/NTK007_1915",
    "http://www.kase.kz/en/gsecs/show/NTK008_1913")
    
    # data
    instruments <- data.frame(name, URL, stringsAsFactors = FALSE)
    
    # define the columns wanted and the mapping to desired name
    # extend to all wanted columns
    wanted <- c("Nominal value in issue's currency" = "Nominal Value",
                "Number of bonds outstanding" = "# of Bonds Issue")
    
    # function returns a data frame of wanted columns for given URL
    getValues <- function (name, url) {
      # get the table and rename columns
      sp = readHTMLTable(url, stringsAsFactors = FALSE)
      df <- sp[[4]]
      names(df) <- c("full_name", "value")
    
      # filter and remap wanted columns
      result <- df[df$full_name %in% names(wanted),]
      result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
    
      # add the identifier to every row
      result$name <- name
      return (result[,c("name", "column_name", "value")])
    }
    
    # invoke function for each name/URL pair - returns list of data frames
    columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})
    
    # bind using dplyr:bind_rows to make a tall data frame
    tall <- bind_rows(columns)
    
    # make wide using dcast from reshape2
    wide <- dcast(tall, name ~ column_name, id.vars = "value")
    
    wide
    
    #                               name # of Bonds Issue Nominal Value
    # 1 NBRK KZW1KD079112 discount notes    1,929,319,196        100.00
    # 2 NBRK KZW1KD079146 discount notes    1,575,000,000        100.00
    # 3 NBRK KZW1KD079153 discount notes      701,390,693        100.00
    # 4 NBRK KZW1KD089137 discount notes    1,380,368,000        100.00
    
        enter code here
    

    【讨论】:

    • 非常感谢。您的代码正在运行,但是当我尝试为所有仪器运行它时,我收到以下错误:Error in $(*tmp*, "name", value = "KZW1KD919127") : replacement has 1 row, data has 0 任何想法为什么会发生这种情况?
    • 嗯...我刚刚检查完这个特定的仪器,发现我需要的数据不在sp[[4]] 中,而是在sp[[7]] 中。是否有可能将这种情况纳入其中?
    • 不太好的 hack 类似于 df &lt;- if_else(name == "foo", sp[[4]], sp[[7]])。我通常更喜欢dplyr if_else 而不是基础ifelse,因为它保留了类。更好的方法是学习使用library(rvest),因为它支持html_nodes函数中的CSS选择器,它可以定位html中的'id'属性而不是位置
    • 是的,好吧...... xml 似乎更简单,因为我不知道 CSS。 SelectorGadget 对rvest 的帮助不大。不管怎么说,还是要谢谢你。将进一步调查。
    • 现在使用rvest我修改了这个位:getValues &lt;- function (name, url) { df &lt;- url %&gt;% read_html() %&gt;% html_nodes("table.top") %&gt;% html_table() df = as.data.frame(df) names(df) &lt;- c("full_name", "value")再次非常感谢!