从 URL 读取随机样本答案

【问题标题】：Read a random sample from URL从 URL 读取随机样本
【发布时间】：2018-04-02 20:18:31
【问题描述】：

我想从 URL 读取 csv 格式文件的随机样本。

到目前为止：

library(tidyverse)
library(data.table)

# load dataset from url, skip the first 16 rows
# then *after* reading it completely, use dplyr function
# for sampling. quite dumb, I want to do it while 
# reading the file

df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
  sample_frac(.01) %>% 
  rename(password = V1)

然后我尝试了，正如几篇帖子中所建议的那样：

df <- fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16, header = F)

但这对我不起作用。错误：

shuf: 'http://datashaping.com/passwords.txt': No such file or directory
Error in fread("shuf -n 10 http://datashaping.com/passwords.txt", skip = 16,  : 
  File is empty: /dev/shm/file1ab1608b13cf

此外，fread 似乎相当慢。

有什么想法吗？基准测试？

编辑

我尝试对 read.csv() 与 fread() 进行基准测试：

benchmark("read.csv" = {
            df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16)
            df <- df %>%
                sample_n(10) %>% 
                rename(password = V1)
          }, {
          df <- fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10") 
          },
          replications = 100,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
“Stopped reading at empty line 9 but text exists afterwards (discarded): 08090728”Warning message in fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10"):
“Stopped reading at empty line 6 but text exists afterwards (discarded): 0307737205”

【问题讨论】：

“不适合我”对您来说究竟意味着什么？你在linux机器上吗？ shuf 需要一个文件，而不是 URL。如果您使用wget 获取文件并将其流式传输到shuf 会怎样。喜欢fread("wget -S -O - http://datashaping.com/passwords.txt | shuf -n10")（确保不再跳过）。是花更多时间下载文件还是阅读/过滤文件？
@MrFlick mget 对我也不起作用。我猜他正在使用 Windows 机器，我也是。这很有趣：stackoverflow.com/questions/47172355/…
@xxxvinxxx 有什么理由要在阅读时随机播放吗？（文件太大？）因为您可以复制第一个示例中的语法df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>% sample_frac(.001) %>% rename(password = V1)
@MaxFt 我假设使用涉及 dplyr 的方法，它首先下载内存中的所有内容，然后应用随机采样。我想知道在下载数据的过程中是否有办法做到这一点，但可能是不可能的......
@MrFlick 我更新了我的问题，试图添加一个针对另一个版本的基准测试，除了运行时间太长之外，它最终会出现一条我真的不明白的错误消息。

标签： r dplyr data.table

【解决方案1】：

看起来该文件不是 CSV，并且数据从第 15 行开始。我现在在 Windows 10 上，这对我非常有效（整个样本，不是随机样本）：

> test <- fread("http://datashaping.com/passwords.txt",skip=15)
trying URL 'http://datashaping.com/passwords.txt'
Content type 'text/plain' length 20163417 bytes (19.2 MB)
downloaded 19.2 MB

Read 2069847 rows and 1 (of 1) columns from 0.019 GB file in 00:00:03

它按预期提供了data.table 结构：

> str(test)
Classes ‘data.table’ and 'data.frame':  2069847 obs. of  1 variable:
 $ #: chr  "07606374520" "piontekendre" "rambo144" "primoz123" ...
 - attr(*, ".internal.selfref")=<externalptr>

您可以像这样访问所有数据（使用with=FALSE 按列号引用）：

> test[,1,with=FALSE]
                    #
      1:  07606374520
      2: piontekendre
      3:     rambo144
      4:    primoz123
      5:      sal1387
     ---             
2069843:     26778982
2069844:      brazer1
2069845:   usethisone
2069846:  scare222273
2069847:     anto1962

您可以像这样访问个人密码：

> test[1,1,with=FALSE]
             #
1: 07606374520
> test[5,1,with=FALSE]
         #
1: sal1387

【讨论】：