将 SFrame 转换为输入数据集答案

【问题标题】：Converting SFrames into input dataset Sframes将 SFrame 转换为输入数据集
【发布时间】：2016-06-15 14:39:43
【问题描述】：

我有一个非常糟糕的方法来将我的输入日志转换为输入数据集。我有一个格式如下的 SFrame sf：

user_id     int
timestamp   datetime.datetime
action      int
reasoncode  str

action列占用1到9的9个值。

因此，每个 user_id 都可以执行多个操作，不止一次。

我正在尝试从 sf 获取所有唯一的 user_id 并以下列方式创建一个 op_sf：

y = 225

def calc_class(a,x):
  diffd = a['timestamp'].apply(lambda x: (dte - x).days)
  g = 0
  b = 0
  for i in diffd:
    if i > y:
    g += 1
  else:
    b += 1
  if b>= x:
    return 4
  elif b!= 0:
    return 3
  elif g>= 0:
    return 2
  else:
    return 1

l1 = []
ids = z['user_id'].unique()

for idd in ids:
 temp = sf[sf['user_id']== idd]
 zero1 = temp[temp['action'] == 1]
 zero2 = temp[temp['action'] == 2]
 zero3 = temp[temp['action'] == 3]
 zero4 = temp[temp['action'] == 4]
 zero5 = temp[temp['action'] == 5]
 zero6 = temp[temp['action'] == 6]
 zero7 = temp[temp['action'] == 7]
 zeroh8 = temp[temp['reasoncode'] == 'xyz']
 zero9 = temp[temp['reasoncode'] == 'abc']
 /* I'm getting clas1 to clas9 from function calc_class for each action
    clas1 to clas9 are 4 integers ranging from 1 to 4
 */ 
 clas1 = calc_class(zero1,2)
 clas2 = calc_class(zero2,2)
 clas3 = calc_class(zero3,2)
 clas4 = calc_class(zero4,2)
 clas5 = calc_class(zero5,2)
 clas6 = calc_class(zero6,2)
 clas7 = calc_class(zero7,2)
 clas8 = calc_class(zero8,2)
 clas9 = calc_class(zero9,2)
 l1.append([idd,clas1,clas2,clas3,clas4,clas5*(-1),clas6*(-1),clas7*(-1),clas8*(-1),clas9])

我想知道这是否是最快的方法。具体来说，是否可以在不生成 zero1 到 zero9 SFrame 的情况下做同样的事情。

一个例子：

user_id timestamp action reasoncode 
574 23/09/15 12:43  1   None
574 23/09/15 11:15  2   None
574 06/10/15 11:20  2   None
574 06/10/15 11:21  3   None
588 04/11/15 10:00  1   None
588 05/11/15 10:00  1   None
555 15/12/15 13:00  1   None
585 22/12/15 17:30  1   None
585 15/01/16 07:44  7   xyz
588 06/01/16 08:10  7   abc

l1对应上面的sf:

574 1   2   2   0   0   0   0   0   0
588 3   0   0   0   0   0   0   0   3
555 3   0   0   0   0   0   0   0   0
585 3   0   0   0   0   0   0   3   0

【问题讨论】：

从这段代码 sn-p，很难理解你想要做什么。您能否更详细地描述您的目标，并展示输入和输出数据的小示例？
你这里似乎有一些复杂的逻辑。正如@papayawarrior 所建议的那样，使用一些示例数据提供一个更简单的示例会很有帮助。但是，从生成所有“零”sframe 的逻辑来看，我看不出有任何理由让您循环遍历每个唯一 ID。您可能可以通过将 apply 与具有此逻辑的函数一起使用来避免生成所有这些（例如，如果 action == 6：使用 x6，无论如何，如果 7，则使用 x7）并将其应用于整个 SFrame。您还可以将时间戳转换为整个 SFrame。
@EvanSamanas 你能举个例子吗？

标签： performance graphlab sframe

【解决方案1】：

我认为您的逻辑相对复杂，但在整个数据集上使用按列操作仍然更有效，而不是为每个用户提取行的子集。关键工具是SFrame.groupby、SFrame.apply、SFrame.unstack 和SFrame.unpack。 API 文档在这里：

https://dato.com/products/create/docs/generated/graphlab.SFrame.html

这是一个解决方案，它使用比您的示例更简单的数据和更简单的逻辑来编写旧操作与新操作。

# Set up and make the data
import graphlab as gl
import datetime as dt

sf = gl.SFrame({'user': [574, 574, 574, 588, 588, 588],
                'timestamp': [dt.datetime(2015, 9, 23), dt.datetime(2015, 9, 23),
                              dt.datetime(2015, 10, 6), dt.datetime(2015, 11, 4),
                              dt.datetime(2015, 11, 5), dt.datetime(2016, 1, 6)],
                'action': [1, 2, 3, 1, 1, 7]})

# Count old vs. new actions.
sf['days_elapsed'] = (dt.datetime.today() - sf['timestamp']) / (3600 * 24)
sf['old_threshold'] = sf['days_elapsed'] > 225

aggregator = {'total_count': gl.aggregate.COUNT('user'),
              'old_count': gl.aggregate.SUM('old_threshold')}
grp = sf.groupby(['user', 'action'], aggregator)

# Code the actions according to old vs. new. Use your own logic here.
grp['action_code'] = grp.apply(
                       lambda x: 2 if x['total_count'] > x['old_count'] else 1)
grp = grp[['user', 'action', 'action_code']]

# Reshape the results into columns.
sf_new = (grp.unstack(['action', 'action_code'], new_column_name='action_code')
             .unpack('action_code'))

# Fill in zeros for entries with no actions.
for c in sf_new.column_names():
    sf_new[c] = sf_new[c].fillna(0)

print sf_new

+------+---------------+---------------+---------------+---------------+
| user | action_code.1 | action_code.2 | action_code.3 | action_code.7 |
+------+---------------+---------------+---------------+---------------+
| 588  |       2       |       0       |       0       |       2       |
| 574  |       1       |       1       |       1       |       0       |
+------+---------------+---------------+---------------+---------------+
[2 rows x 5 columns]

【讨论】：

太棒了！我学到了很多！我不知道 unpack()。非常感谢