将SAS数据集并行拆分为互斥行？答案

【问题标题】：Splitting a SAS dataset into mutual-exclusive by row in parallel?将SAS数据集并行拆分为互斥行？
【发布时间】：2017-11-16 23:13:06
【问题描述】：

有SO question on split a large dataset into smaller one，但随着proc ds2 的出现，一定有办法使用线程来做到这一点？

我已经编写了以下数据步骤来将数据集拆分为&chunks. 块。我试图在proc ds 中写同样的内容，但它失败了。我对proc ds2 还很陌生，所以对于对数据步骤有很好理解的人来说一个简单的解释是理想的。

数据步码

%macro output_chunks(in, out, by, chunks);
data %do i = 1 %to &chunks.;
    &out.&i.(compress=char drop = i)
    %end;
;
    set &in.;
    by &by.;
    retain i 0;
    if first.&by. then do;
        i = i + 1;      
        if i = &chunks.+1 then i = 1;
    end;

    %do i = 1 %to &chunks.;
        if i = &i. then do;
            output  &out.&i.;
        end;
    %end;
run;
%mend;

proc ds2 代码

proc ds2; 
  thread split/overwrite=yes; 
    method run(); 
      set in_data; 
      thisThread=_threadid_; 
      /* can make below into macro but I can't seem to get it to work */
      if thisThread = 1 then do;
        output ds1;
      end;
      else if thisThread = 2 then do;
        output ds2;
      end;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit;

【问题讨论】：

澄清一下：您想在这里专门讨论sas-ds2 解决方案吗？或者您想找到比常规数据步骤更有效的通用解决方案吗？（例如，使用哈希表执行此操作的方法比我在该线程中建议的解决方案更有效。）
注意：我为该问题添加了哈希解决方案。
您是否安装了 SAS/CONNECT？这可能是另一种可能的方法。

标签： sas sas-ds2

【解决方案1】：

所以，从某种意义上说，DS/2 在这里可能会有所帮助，这是对的。但是，我怀疑它有点复杂。

DS/2 将愉快地处理数据步骤，但更具挑战性的是写入几个不同的数据集。这是因为在不使用宏语言的情况下没有一种很好的方法来构造输出数据集名称，据我所知，宏语言不会很好地处理线程（尽管我在这里不是专家）。

这是一个使用线程的例子：

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             output;             
         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data out_data/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           output;
         end;
     enddata;
     run;
quit;

但这只是写出一个数据集，如果你将threads=4 更改为 1，它实际上不会花费更多时间。两者在速度方面都还可以，尽管实际上比常规数据步骤慢（对我来说大约是速度的 1.8 倍）。在访问 SAS 数据集时，DS/2 使用比 SAS 的基础数据步骤慢得多的方法来访问后台数据；当您通过 SQL 或类似方式在 RDBMS 中工作时，DS/2 的速度提升真正发挥了作用。

但是，没有很好的方法来并行驱动输出。这是上面的版本变成了 4 个数据集。请注意，输出位置的实际选择是在主要的非线程数据步骤中......

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     dcl bigint thisThread;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             thisThread = _threadid_;
             output;

         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data a b c d/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           select(thisThread);
             when (1) output a;
             when (2) output b;
             when (3) output c;
             when (4) output d;
             otherwise;
           end;
         end;
     enddata;
     run;
quit;

所以它实际上比非线程版本慢很多。糟糕！

确实，您的问题是磁盘 i/o 是主要问题，而不是 CPU。您的 CPU 在这里几乎不起作用。 DS/2 可能能够在某些边缘情况下提供帮助，即您拥有一个非常快速的 SAN，它允许大量同时写入，但最终读取这百万条记录需要 X 时间，而写入一百万记录需要同样 X 时间，基于您的 i/o 约束，并且并行化的可能性无济于事。

我怀疑哈希表会增加很多，当然可以在这里与 DS/2 一起使用；有关数据步骤版本的 OP 中链接的另一个问题，请参阅我的新答案。 DS/2 可能不会使该解决方案更快，更可能更慢；但是如果你愿意，你可以在 DS/2 中实现大致相同的东西，然后子线程将能够在不涉及主线程的情况下自行输出。

如果您在 Teradata 或其他工具中执行此操作，DS/2 会有所帮助，您可以使用 SAS 的数据库内加速器在数据库端执行此代码。那会让事情变得更有效率。然后你可以使用类似于我上面的代码的东西，或者更好的哈希解决方案。

【讨论】：

【解决方案2】：

使用 HoH 方法拆分数据集的用户定义 DS2 包示例，最大的缺点是由于 DS2 中变量列表的实用性非常有限，因此无法在没有大量捏造的情况下通过键命名数据集，因为结果我选择了更简单的命名约定：

data cars;
set sashelp.cars;
run;

proc ds2;

package hashSplit / overwrite=yes;

declare package hash  h  ();
declare package hash  hs ();
declare package hiter hi;

/**
  * create a child multidata hash object
  */
private method mHashSub(varlist k, varlist d) returns package hash;
  hs = _new_ [this] hash();
  hs.keys(k);
  hs.data(d);
  hs.multidata('Y');
  hs.defineDone();
  return hs;
end;

/**
  * constructor, create the parent and child hash objects
  */
method hashSplit(varlist k);
  h = _new_ [this] hash();
  h.keys(k);
  h.definedata('hs');
  h.defineDone();
end;

/**
  * adds key values to parent hash, if necessary
  * adds key values and data values to child hash
  * consilidates the FIND, ADD and nested ADD methods
  */
method add(varlist k, varlist d);
  declare double rc;

  rc = h.find();
  if rc ^= 0 then do;
    hs = mHashSub(k, d);
      h.add();
  end;
  hs.add();
end;

/**
  * outputs the child hashes to data sets with a fixed naming convention
  *
  * SAS needs to add more support for using variable lists with functions/methods besides hash
  */
method output();
  declare double rc;
  declare int i;

  hi = _new_ hiter('h');

  rc = hi.first();
  do i = 1 to h.num_items by 1 while (rc = 0);
    hs.output(catx('_', 'hashSplit', i));
      rc = hi.next();
  end;
end;

endpackage;
run;
quit;

/**
  * example of using the hashSplit package
  */
proc ds2;
data _null_;
varlist k [origin];
varlist d [_all_];
declare package hashSplit split(k);

method run();
  set cars;
  split.add(k, d);
end;

method term();
  split.output();
end;
enddata;
run;
quit;

【讨论】：

这真的使用线程吗？如果是这样，它会提高性能吗？