PostgreSQL——如何生成具有外键约束的随机数行假数据？答案

【问题标题】：PostgreSQL -- how to generate random number of rows of fake data with foreign key constraints?PostgreSQL——如何生成具有外键约束的随机数行假数据？
【发布时间】：2021-02-17 21:58:29
【问题描述】：

注意：我尝试搜索解决此问题的其他问题和答案，但找不到真正符合我的问题并提供完整解决方案的问题。

我正在尝试使用 SQL 生成随机的合成数据来测试我的数据库架构。虽然使用random() 在PostgreSQL 中生成一堆随机值很容易，但生成随机数据集 来保留我期望看到的数据的约束和特征却不是那么容易。具体来说，我有以下表格：

CREATE TABLE suites(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO suites(name)
SELECT 'suite' || g FROM generate_series(1,50) g;

CREATE TABLE tests(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO tests(name)
SELECT 'test' || g FROM generate_series(1,100) g;

CREATE TABLE tests_in_suites(
suite_id BIGINT,
test_id BIGINT,
PRIMARY KEY (suite_id, test_id)
);

DB Fiddle

我想用随机值填充tests_in_suites，这样每个套件都包含一个随机数（3 到 7 之间）的测试，从tests 中统一选择。我希望选择是随机且统一的，并避免循环和其他重复模式。我尝试了以下方法：

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5)) t
ORDER BY s.id, t.id;

DB Fiddle

但它总是为每个套件选择相同数量的测试，并且选择的测试是相同的，因为优化器将 s 的子查询替换为常量。我尝试引入对当前正在考虑的套件的依赖，但它抱怨我尝试使用的值不可访问：

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
ORDER BY s.id, t.id;

ERROR:  invalid reference to FROM-clause entry for table "s"
LINE 3: ...s ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
                                                             ^
HINT:  There is an entry for table "s", but it cannot be referenced from this part of the query.

DB Fiddle

如何生成随机数据而不会成为查询中优化器或无效数据依赖项的牺牲品？

【问题讨论】：

标签： sql postgresql random foreign-keys data-generation

【解决方案1】：

我想用随机值填充 tests_in_suites，这样每个套件都包含一个随机数（3 到 7 之间）的测试，从测试中统一选择

这听起来像是横向连接的一个不错的用例...

INSERT INTO tests_in_suites(suite_id,test_id)
SELECT suites.id, t.id
FROM suites
CROSS JOIN LATERAL (SELECT id, suites.id AS lol FROM tests ORDER BY random() LIMIT (3+random()*4)) t;

横向联接为联接左侧的表的每一行重新计算联接表，这就是我们这里想要的。但是如果连接的表子查询看起来不是依赖子查询，postgres 会优化它。所以我在连接表中添加了 suites.id 以使其看起来连接表确实依赖于表套件中的行。

array_agg() 和 unnest() 可能也有办法做到这一点。

【讨论】：

噢，太棒了！我不知道横向连接是一回事，你的代码更短更干净，谢谢！

【解决方案2】：

我找到的解决方案受到我在网上看到的几个食谱的启发（尤其是在使用 row_number() 随机选择行时），但它包含我自己的见解，我还没有看到这种方法在任何地方使用。

关键组件是将生成随机行的艰巨任务分解为一系列更简单的任务，其中每一步我只生成随机整数。然后，为了生成行，我使用递归 CTE，最后在窗口函数 (row_number()) 上使用 JOIN 将行合并到我的结果表中。

以下解决方案已在 PostgreSQL 10 和 12 上进行了测试，但它应该适用于任何支持递归 CTE 和窗口函数的版本。它也应该很容易适应任何其他支持这些的 RDBMS。

-- For each suite, add a random number (between 3 and 7) of tests
-- mapped. Because it's difficult to join a random number of rows
-- between two tables in SQL without violating data dependency rules
-- and/or having the optimiser lift it out into a constant, repeating
-- pattern, instead we do it in several steps:
--
-- * For each suite ID, generate a random number between 3 and 7
--   representing the number of tests we want to include
-- * Then, using a recursive CTE, for each suite ID generate rows,
--   each with a random integer no larger than the number of
--   tests. Limit the number of rows to the small integer generated in
--   the previous step
-- * Join the table generated in the above CTE with tests on row
--   number, using the random int generated as the row number to
--   pick. This gives us a table containing three values: suite_id,
--   test_id, random row number. By extracting only the IDs, we have
--   now generate the values to insert into tests_in_suies
INSERT INTO tests_in_suites
-- "+ id*0" serves to ensure the optimiser sees a dependency on the
-- current row and doesn't lift the random() out as a constant
WITH s(id, n_tests) AS (SELECT id, 2 + ceil(random() * 5) + id*0 FROM suites),
cnt AS (SELECT COUNT(*) FROM tests),
t AS (SELECT id, row_number() over (ORDER BY random()) AS rn FROM tests),
sr AS (SELECT * FROM
       (WITH RECURSIVE subtests(sid, n, rn) AS (
             SELECT s.id, n_tests + 1, NULL::bigint FROM s
             UNION
             SELECT sid, n - 1, ceil(random() * (SELECT * FROM cnt))::bigint
             FROM subtests
             WHERE n > 1)
        SELECT * FROM subtests) x
        WHERE rn IS NOT NULL
        ORDER BY sid)
SELECT sid, t.id FROM sr JOIN t USING(rn)
ORDER BY sid, t.id
-- The above will process generate a couple duplicates. They're not a
-- big deal, so just skip them
ON CONFLICT DO NOTHING;


SELECT seen, total, seen / total::double precision as "fraction used" FROM
        (SELECT count(*) AS seen FROM (SELECT DISTINCT test_id FROM tests_in_suites) t) x,
        (SELECT count(*) AS total FROM tests) y;

SELECT suite_id, count(suite_id) FROM tests_in_suites GROUP BY suite_id;

SELECT * FROM tests_in_suites;

DB Fiddle

【讨论】：