【问题标题】:Need to extract millions of rows from multiple tables in mysql database需要从mysql数据库中的多个表中提取数百万行
【发布时间】:2021-12-12 00:17:27
【问题描述】:

我必须从 21 个 SQL 表中提取数百万行以获得字符串列表。此外,我的查询有分组连接等。 我需要使用块以内存友好的方式运行它。我正在使用python2。 任何人都可以提出解决方案吗? 说这里是我的查询,只是一个例子。

query = """
        SELECT
            v.id AS varid,
            v.chrom AS chrom,
            v.vcf_pos AS vcf_pos,
            v.vcf_ref AS vcf_ref,
            v.vcf_alt AS vcf_alt,
            group_concat(distinct term.term) as HPO_terms,
            group_concat(distinct term.name) as HPO_names,
            group_concat(distinct pp.patientid,"-",pp.person_status,"-",pp.affected_status) as family_label,
          
            if(vcc.AF_Pat between 0 AND 1, vcc.AF_Pat, NULL) AS AF_Pat,
            replace(vcc.analysistypelist,',',';') AS analysistypelist,
            vcc2.HomPatCount AS AC,
            vcc2.HetPatCount AS HAC,
            vcc2.TotalPatCount AS TAAC,
            if(vcc2.AF_Pat between 0 AND 1, vcc2.AF_Pat, NULL) AS AF_Assay,
            vcc3.HomUnaffCount AS HCC,
            vcc3.HetUnaffCount AS HCC1,
            vcc3.TotalUnaffCount AS TTCC,
            if(vcc3.AF_healthy between 0 AND 1, vcc3.AF_healthy, NULL) AS AF_Control,
            g.gene_name as gene_name,
            t.tx_name as tx_name,
            ta.*,
            va.*,
            vc.*,
            ga.*,
            group_concat(
                concat_ws(':', ifnull(g.gene_name,'.'), ifnull(t.tx_name,'.'), ifnull(ta.hgvsc,'.'), ifnull(ta.hgvsp,'.'))
                SEPARATOR '|'
            ) as `AllTranscriptAnnotations`
        FROM {} AS v
            LEFT JOIN table1 vcc ON vcc.variant_id=v.id
            LEFT JOIN table2 vcc3 ON vcc3.variant_id=v.id
            LEFT JOIN table3 va on v.id=va.variant_id and va.status='active'
            LEFT JOIN table4 vc on v.id=vc.variant_id
            LEFT JOIN table5 ta on v.id=ta.variant_id and ta.status='active'
            LEFT JOIN table6 t on t.id=ta.transcript_id and t.status='active'
            LEFT JOIN table7 g on g.id=t.gene_id and g.status='active'
            .
            .
            .
            LEFT JOIN table21 pt on pt.term_id=term.id
        GROUP BY v.id,s.id
        HAVING 1 {}
        """.format(v1,sq)

其中v1sq 是搜索字符串。 现在,我需要使上述查询内存高效或优化,目前完成提取需要 4 个多小时。 我正在寻找分而治之的东西。

【问题讨论】:

  • v 控制行吗?也就是说,所有 LEFT JOIN 是否都提供了 1 行(可选地全部为 NULL)?

标签: python mysql pymysql


【解决方案1】:

不要使用OFFSET;当你通过桌子时,它会越来越慢。

我会考虑在v.id 上分块:

WHERE v.id >    0 AND v.id <= 1000   -- in first run
WHERE v.id > 1000 AND v.id <= 2000   -- 2nd
(etc)

关于分块的更多信息:http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks

该链接有一些处理变体的技巧(例如id 中的大空白)。

真的需要GROUP_CONCAT() 吗?这意味着至少有一些LEFT JOINs 提供了不止一行。在这种情况下,哪些表? (这可能会导致需要同时删除 GROUP_CONCAT and the GROUP BY 的优化。)

【讨论】:

    猜你喜欢
    • 2016-09-04
    • 1970-01-01
    • 2018-12-21
    • 2018-06-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-05-20
    • 2011-08-15
    相关资源
    最近更新 更多