【发布时间】:2020-06-10 12:25:25
【问题描述】:
我有一个 txt 文件的 URL 列表。 txt 文件的结构使得有些部分是纯文本,有些部分是表格。我想提取表并将它们导出到数据框。下面是一个 URL 示例:
https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt
txt 文件的结构使得表格以<TABLE> 开头并以</TABLE> 结尾。我想合并所有表格。我曾尝试使用 read.delim,但我不知道如何仅将其用于表格。以下是预期输出的示例。对于如何继续我的项目的任何指导,我将不胜感激。
Example of current df:
+----+--------------------------------------------------------------------------+
| ID | URL |
+----+--------------------------------------------------------------------------+
| 1 | https://www.sec.gov/Archives/edgar/data/1000097/0000919574-13-001835.txt |
| 2 | https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt |
| 3 | https://www.sec.gov/Archives/edgar/data/1000742/0000898432-13-000218.txt |
+----+--------------------------------------------------------------------------+
Example of txt file from url:
text text text
text text text
text text text
<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC | COM | 00287Y109 | 1,547 | 45,300 | SHS | | Shared-Defined | 1/2/3 | 45,300 |
| ABERCROMBIE & FITCH | CL A | 002896207 | 4,797 | 100,000 | SHS | | Shared-Defined | 1/2/3 | 100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>
<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC | COM | 00287Y109 | 1,547 | 45,300 | SHS | | Shared-Defined | 1/2/3 | 45,300 |
| ABERCROMBIE & FITCH | CL A | 002896207 | 4,797 | 100,000 | SHS | | Shared-Defined | 1/2/3 | 100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>
Expected output:
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ID | NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| 1 | x | x | x | x | x | x | x | x | x | x |
| 1 | x | x | x | x | x | x | x | x | x | x |
| 1 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
| 2 | x | x | x | x | x | x | x | x | x | x |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
【问题讨论】:
-
嗯,第一步是定位
<TABLE>和</TABLE>之间的文本块。你是怎么做到的?然后你需要解析每个块中的单元格定义。给我们一些可以合作的东西! -
不幸的是,我也被困在那个部分。我在网上查看并尝试了几种方法,包括
fread、read.pattern和Readlines,但我无法让它们按预期工作。
标签: r dataframe text-extraction readr