如何比较Oracle和SQL Server之间的大表（亿行）数据答案

【问题标题】：How to compare huge table(100 million rows) data between Oracle and SQL Server如何比较Oracle和SQL Server之间的大表（亿行）数据
【发布时间】：2021-04-27 03:26:32
【问题描述】：

我有一个进程填充一个有超过 1 亿行的 oracl 表。表结构如下

**ORACLE_TABLE**
id|contractdatetime|Attr3|Attr4|Attr5

(id,contractdatetime) 的组合在此表中是唯一的，它使用外部进程填充。

distinct id 的总数仅约为 30000。每个 id 都有一个唯一的 contractdatetime。 id 不是唯一的，但(id,contractdatetime) 的组合是

现在另一个进程在 SQL Server 中填充了一个相同的表

**SQLSERVER_TABLE**
id|contractdatetime|Attr3|Attr4|Attr5

我正在考虑检查两个表的数据是否相同的最佳方法。我想我是否可以通过 contractid 获得散列版本并以某种方式聚合 Oracle 中的所有其他属性。如果我可以在 SQL Server 中做同样的事情，我将能够在 excel 本身（30000）行中进行比较。

我已经搜索了堆栈溢出，但无法获得与 MD5_XOR 或任何可以帮助实现此链接的相同功能。 http://www.db-nemec.com/MD5/CompareTablesUsingMD5Hash.html

使用链接服务器等其他选项在获得批准方面会更加困难。

有什么好的方法可以解决这个问题

【问题讨论】：

散列并不是万无一失的，因为它很容易产生冲突。为什么需要两份同一个亿行表？他们不能以某种方式引用同一个副本吗？
@AaronBertrand 它们都使用两个单独的进程加载（具有相同的逻辑）。任务是查看数据是否匹配。例如，我们确实将记录计数作为第一级检查，如果匹配，则选择使用散列/或其他方式进行比较
但是你为什么要加载两个副本呢？为什么你认为在两边将 3 个值合二为一会比只比较两边的所有三个值更快？
如果数据匹配，新进程旨在避开第一个进程。预言机进程是遗留进程，新进程将被消费者使用
如果您想确保新过程产生相同的结果，我会独立比较列，而不是花时间尝试找到一些可靠的散列方法。恕我直言，即使您可以信任它也不会得到回报。

标签： sql sql-server database oracle

【解决方案1】：

为了在 Oracle 和 SQL Server 表之间进行快速、高级的比较，您可以使用函数 STANDARD_HASH 和 HASH_BYTES 的聚合。

Oracle 代码

--Create a simple table.
create table table1
(
    id number,
    contractdatetime date,
    Attr3 varchar2(100),
    Attr4 varchar2(100),
    Attr5 varchar2(100)
);

--Insert 4 rows, the first three will be identical between databases,
--the last row will be different.
insert into table1 values (1, date '2000-01-01', 'a', 'a', 'a');
insert into table1 values (2, date '2000-01-01', 'b', 'b', 'b');
insert into table1 values (2, date '2000-01-02', null, null, null);
insert into table1 values (3, date '2000-01-02', 'Oracle', 'Oracle', 'Oracle');
commit;

select
    id,
    --Format the number
    trim(to_number(
        --Sum per group.
        sum(
            --Convert to a number.
            to_number(
                --Get the first 14 bytes. This seems to be the maximum that SQL Server can handle
                --before it runs into math errors.
                substr(
                    --Hash the value.
                    standard_hash(
                        --Concatenate the values using (hopefully) unique strings to separate the
                        --columns and represent NULLs (because the hashing functions treat nulls differently.)
                        nvl(to_char(contractdatetime, 'YYYY-MM-DD HH24:MI:SS'), 'null') || 
                        '-1-' || nvl(attr3, 'null') || '-2-' || nvl(attr3, 'null') || '-3-' || nvl(attr3, 'null')
                        , 'MD5')
                    , 1, 14)
                , 'xxxxxxxxxxxxxxxxxxxx'))
        , '99999999999999999999')) hash
from table1
group by id
order by 1;

SQL Server 代码

create table table1
(
    id numeric,
    contractdatetime datetime,
    Attr3 varchar(100),
    Attr4 varchar(100),
    Attr5 varchar(100)
);

insert into table1 values (1, cast('2000-01-01 00:00:00.000' as datetime), 'a', 'a', 'a');
insert into table1 values (2, cast('2000-01-01 00:00:00.000' as datetime), 'b', 'b', 'b');
insert into table1 values (2, cast('2000-01-02 00:00:00.000' as datetime), null, null, null);
insert into table1 values (3, cast('2000-01-02 00:00:00.000' as datetime), 'SQL Server', 'SQL Server', 'SQL Server');
commit;

select
    id,
    sum(
        convert(bigint, convert(varbinary, 
            substring(
                hashbytes('MD5',
                    isnull(convert(varchar(19), contractdatetime, 20), 'null') +
                    '-1-' + isnull(attr3, 'null') + '-2-' + isnull(attr3, 'null') + '-3-' + isnull(attr3, 'null'))
                , 1, 7)
            , 1))) hash
from table1
group by id
order by 1;

结果

正如预期的那样，前两组的哈希值相同，而第三组的哈希值不同。

Oracle:

ID  HASH
1   50696302970576522
2   69171702324546493
3   50787287321473273

SQL Server

ID  HASH
1   50696302970576522
2   69171702324546493
3   7440319042693061

这是Oracle fiddle 和SQL Server fiddle。

问题

我认为此解决方案仅在数据库使用相似字符集或仅使用在不同字符集中通常编码相同的前 127 个 ASCII 字符时才有效。
哈希冲突的可能性很高（也许是不合理的）。 MD5 散列不足以防止加密攻击，但它们足以比较数据集。问题是我必须使用子字符串来使数学适用于 SQL Server。这可能是我对 SQL Server 不够了解的错——BIGIINTS 应该支持大约 19 位的精度，但我的数学只能达到 14 位。我可能在某个地方有一个转换错误。如果您遇到太多碰撞或溢出问题，您可能需要使用“14”和“7”数字。（Oracle 为 14，根据显示的十六进制字符计数。SQL Server 为 7，根据每个十六进制字符可以表示的字符数计数，即 0.5。）

【讨论】：

这很有帮助，有一个问题，为什么在 SQL Server 中 7 个字符是子字符串，而在 Oracle 中是 14 个字符
@GeorgeJoseph 我添加了关于 Oracle 如何根据显示的十六进制字符进行计数的解释，而 SQL Server 如何根据十六进制字符可以表示的字符数进行计数。