创建具有 200k 唯一值的虚拟变量答案

【问题标题】：Creating Dummy Variable with 200k unique value创建具有 200k 唯一值的虚拟变量
【发布时间】：2021-04-21 09:25:42
【问题描述】：

我正在尝试为分类数据集创建一个虚拟变量，但问题是 python 没有兼容的 ram 来运行代码，因为唯一值太大而无法创建虚拟变量。它是一个包含 500k 行和 200k 唯一值的大型数据集。是否可以创建一个具有 200k 唯一值的虚拟变量？

【问题讨论】：

标签： python jupyter-notebook data-science

【解决方案1】：

确实执行此操作需要大量 RAM。

就我能想到的编程解决方案而言：

降维：如果您的 200K 类别之间存在某种关系，并且可以减少这些类别（例如，这些类别的层次结构级别，因此您可以对类别进行分组并按级别执行分析，例如 lvl1 = 10 类别， lvl2 = 100 等等...）。 May I ask: what type of data do you have which contains 200K unique category values?
拆分数据集并合并结果：我在下面使用 numpy.您最终会得到更小的子集，每个子集都针对 200K 类别进行编码（即使某些类别不存在于子集中）。 Then you need to decide how to further process those subsets。

不知何故，导入语句破坏了格式，所以我在这里将它们分开：

import numpy as np
import random

以及其余的代码：

def np_one_hot_encode(n_categories: int, arr: np.array):
    # Performs one-hot encoding of arr based on n_categories
    # Allows encoding smaller chuncks of a bigger array
    # even if the chunks do not contain 1 occurrence of each category
    # while still producing n_categories columns for each chunks
    result = np.zeros((arr.size, n_categories))
    result[np.arange(arr.size), arr] = 1
    return result


# Testing our encoding function
# even if our input array doesn't contain all categories
# the output does cater for all categories
encoded = np_one_hot_encode(3, np.array([1, 0]))
print('test np_one_hot_encode\n', encoded)
assert np.array_equal(encoded, np.array([[0, 1, 0], [1, 0, 0]]))

# Generating 500K rows with 200K unique categories present at least once
total = int(5e5)
nunique = int(2e5)
uniques = list(range(0, nunique))
random.shuffle(uniques)
values = uniques+(uniques*2)[:total-nunique]
print('Rows count', len(values))
print('Uniques count', len(list(set(values))))

# Produces subsets of the data in (~500K/50 x nuniques) shape:
n_chunks = 50
for i, chunk in enumerate(np.array_split(values, n_chunks)):
    print('chunk', i, 'shape', chunk.shape)
    encoded = np_one_hot_encode(nunique, chunk)
    print('encoded', encoded.shape)

还有输出：

test np_one_hot_encode
[[0. 1. 0.]
[1. 0. 0.]]
Rows count 500000
Uniques count 200000
chunk 0 shape (10000,)
encoded (10000, 200000)
chunk 1 shape (10000,)
encoded (10000, 200000)

分布式处理，使用 Dask、Spark 等工具...这样您就可以处理子集
数据库：我能想到的其他解决方案是将您的模型标准化为数据库（关系或“大”平面数据模型），您可以在其中利用索引来过滤和处理部分数据（仅某些行和某些类别），从而允许您在内存中处理较小的输出

But in the end there is no magic, if ultimately you're tyring to load a N-M matrix into memory with N=500K and M=200K, it will take the RAM it needs to take, there is no way around that，因此最有可能获得的收益是降维或完全不同的数据处理方法（例如分布式计算）。

【讨论】：

感谢您的回答，我认为这真的很有帮助。对于那些 200k 唯一值是字符串类型的数据。
不客气。 string type of data: 能不能详细点？
200k 唯一值是公司名称和工作