【发布时间】:2026-01-27 15:50:01
【问题描述】:
我有一个数据框,它有几列和几行 - 有些包含信息,有些用 NA 填充,应该用某些数据替换。
行代表特定的工具,列包含给定行中工具的各种详细信息。数据框的最后一列有每个工具的 url,然后将用于获取空列的数据:
Issuer NIN or ISIN Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes NA NA NA NA NA
2 NBRK KZW1KD079146 discount notes NA NA NA NA NA
3 NBRK KZW1KD079153 discount notes NA NA NA NA NA
4 NBRK KZW1KD089137 discount notes NA NA NA NA NA
URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913
例如,使用以下代码,我可以获取NBRK KZW1KD079112 行中第一个仪器的详细信息:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
这给出了以下内容:
V1
V2
1 Trading code NTK007_1911
2 List of securities official
3 System of quotation price
4 Unit of quotation nominal value percentage fraction
5 Quotation currency KZT
6 Quotation accuracy 4 characters
7 Trade lists admission date 04/21/17
8 Trade opening date 04/24/17
9 Trade lists exclusion date 04/28/17
10 Security <NA>
11 Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12 NSIN KZW1KD079112
13 Currency of issue and service KZT
14 Nominal value in issue's currency 100.00
15 Number of registered bonds 1,929,319,196
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
18 Settlement basis (days in month / days in year) actual / 365
19 Date of circulation start 04/21/17
20 Circulation term, days 7
21 Register fixation date at maturity 04/27/17
22 Principal repayment date 04/28/17
23 Paying agent Central securities depository JSC (Almaty)
24 Registrar Central securities depository JSC (Almaty)
从此,我只需要保留:
14 Nominal value in issue's currency 100.00
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
19 Date of circulation start 04/21/17
22 Principal repayment date 04/28/17
然后我会将所需的数据复制到初始数据帧并继续下一行...数据帧由 100 多行组成,并且会不断变化。
我将不胜感激。
更新:
看起来我需要的数据并不总是在sp[[4]] 中。有时它的sp[[7]],也许将来它会完全不同。有什么方法可以在抓取的表中查找信息并识别可用于进一步收集数据的特定表?:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
【问题讨论】:
标签: r xml loops web-scraping