R：使用 R edgar 包从 SEC Edgar 数据库中读取旧的 13F txt 文件答案

【问题标题】：R: reading old 13F txt files from SEC Edgar database using R edgar packageR：使用 R edgar 包从 SEC Edgar 数据库中读取旧的 13F txt 文件
【发布时间】：2021-10-06 06:06:00
【问题描述】：

您好，我正在尝试使用 Redgar 包读取 SEC edgar 数据库中的 13F 文件

我面临的挑战是我正在查看的文件是旧文件（~2000 年） https://www.sec.gov/edgar/browse/?CIK=1087699

它们是糟糕的 txt 格式，与今天的 13F 不同，使用 readtxt 函数无法读取。

示例文件在这里：https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt

library(edgar)

F13<-
  getFilings(
  cik.no = "0001087699",
  form.type = "13F-HR",
  1999,
  quarter=c(1,2,3),
  useragent="myname@gmail.com"
)

我试过了，R 只是告诉我它很忙并且永远下载，它不是一个很大的 txt 文件。所以出了点问题。然后当它最终完成时，它说没有找到给定 CIK 和表单类型的归档信息，但我清楚地在查看文件。如果 edgar 包不是专门设计来处理的，那我该怎么办呢？

我的最终目标是将文件保存在漂亮的数据框中，股票代码和价格的列以及股票数据的行。请帮忙。

是否有任何抓取可用？我在 chrome 中通过检查突出显示了灯光，但它们对我来说看起来很奇怪（抱歉，根本不擅长刮擦）。

【问题讨论】：

可能该包指向 EDGAR 上的完整提交文件。如果是这种情况，那么这些就是完整的后端文件，它们为您通常通过浏览器导航的呈现的 HTML 页面提供动力。您可以尝试使用一些诸如 rvest 之类的抓取包，尽管这并不是真正推荐的。或者，您可以开发自己的 scaper 和预处理功能来去除所有不需要的文本。这是我到目前为止所做的。
@FrancescoGrossetti 是的，不幸的是，我不擅长报废......

标签： r web-scraping txt edgar

【解决方案1】：

我解析了您作为示例here 提供的文件。我首先将数据从文件复制到一个txt文件。文件copied.txt 需要位于当前工作目录中。这可以让您知道如何进行。

library(tidyverse)

df <- read_file("copied.txt") %>%
  # trying to extract data only from the table
  (function(x){
    tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
    tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
    str_sub(x, tbl_beg, tbl_end)
    }) %>%
  # removing some unwanted characters from the beginning and the end of the extracted string
  str_sub(start = 4, end = -3) %>%
  # splitting for individual lines
  str_split('\"\r\n\"') %>% unlist() %>%
  # removing broken line break
  str_remove("\r\n") %>%
  # replacing the original text where there are spaces with one, where there is underscore
  # the reason for that is that I need to split the rows into columns using space
  str_replace_all("Sole   Managers Sole", " Managers_Sole") %>%
  # removing extra spaces
  str_squish() %>%
  # reversing the order of the line (I need to split from the right because the company name contains additional spaces)
  # if the company name is the last one, it is okey that there are additional spaces
  stringi::stri_reverse() %>%
  str_split(pattern = " ", n = 6, simplify = T) %>%
  # making the order to the original one
  apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
  as_tibble() %>%
  select(c(6:1)) %>%
  set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares",  "shares_of_princip_mngrs"))

# A tibble: 47 x 6
   name_of_issuer   title_of_cl cusip_number fair_market_value shares  shares_of_princip_mngrs
   <chr>            <chr>       <chr>        <chr>             <chr>   <chr>                  
 1 America Online   COM         02364J104    2,940,000         20,000  Managers_Sole          
 2 Anheuser Busch   COM         35229103     3,045,000         40,000  Managers_Sole          
 3 At Home          COM         45919107     787,500           5,000   Managers_Sole          
 4 AT&T             COM         1957109      5,985,937         75,000  Managers_Sole          
 5 Bank Toyko       COM         65379109     700,000           50,000  Managers_Sole          
 6 Bay View Capital COM         07262L101    14,958,437        792,500 Managers_Sole          
 7 Broadcast.com    COM         111310108    2,954,687         25,000  Managers_Sole          
 8 Chase Manhattan  COM         16161A108    10,578,750        130,000 Managers_Sole          
 9 Chase Manhattan  4/85C       16161A9DQ    59,375            500     Managers_Sole          
10 Cisco Systems    COM         17275R102    4,930,312         45,000  Managers_Sole

【讨论】：

您好，谢谢您的帮助。奇怪的是，当我将相同的文件从 SEC 保存到 txt 并阅读时，代码不起作用。错误代码为“错误：无法对不存在的列进行子集化。x 位置 6、5、4、3 和 2 不存在。ℹ 只有 1 列。”
但是我逐行尝试了您的代码，发现它为什么不起作用。 tbl_end ")[1]，这里是 \r，我不知道 \r 是什么，它确实存在于字符串 r 备忘单中，但我知道你是什么试图做，所以我把 r 拿走了， \\n 作品。接下来，str_sub(start = 4, end = -3) 需要 str_sub(start = 3, end = -3) 才能在线阅读美国，而不是在线阅读美国。
你能解释一下\r吗？除此之外，你的想法和代码在修补后工作。我真的很喜欢你的方法，结果看起来干净整洁。我可以将其调整为其他已切换格式的文件（是的，SEC 根本不一致）。请收下我卑微的代币作为感谢。
@ML33M 很遗憾听到代码没有按预期工作。在操作系统级别，文本文件包含数据，每一行都以换行符 \n 结尾。 \n 代表换行。 Windows等一些操作系统还加了\r，表示返回（即OS需要返回到行首）。由于我在 Windows 上，因此对我而言，该文件包含这些额外的 \r。我假设您的操作系统不同，因此您只有 \n。老实说，我没有意识到这一点，所以我发布了对我有用的代码。
别担心我的朋友，你是对的，我的操作系统是mac。现在我学到了一些新东西，这对我有好处。坦率地说，美国证券交易委员会一直很痛苦，他们的文件不是一致的。我刚才修的那个在下个季度报告中失败了……现在又开始寻找它坏的地方了哈哈

【解决方案2】：

你可以使用httr包来请求页面：

> install.packages("httr")
# follow instructions etc

然后在R shell 中（您可能需要重新启动）：

> httr::GET("https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt")

这将成功下载文件，但是我的 R 语言不够流利，无法解析此文本，但它看起来很简单：用<TABLE> 分割文本，为行添加样条换行符，将每一行用空格分割为列。

【讨论】：

嗨，对不起，我知道你在正确的轨道上，但就像我说的，我真的不知道如何报废。此外，当我将文件复制粘贴到 txt 文件编辑器时，我意识到列之间的间距可能搞砸了：即第 2 行实际上是第 1 行的尾部，只是由于某种原因列转了