【问题标题】:Avoid DBI duplicate primary key with exists operator使用存在运算符避免 DBI 重复主键
【发布时间】:2017-04-12 20:05:34
【问题描述】:

我想将 FASTA 文件中的信息插入 MySQL 数据库的表中。我使用Ensembl_id 列作为主键。

我的一些Ensembl_id 不是唯一的,所以我尝试使用exists 运算符来解决这个问题。但是表中只插入了 5 行,其中只有 1 行具有重复的 Ensembl_id 值。

#!/usr/bin/perl -w

#usage script.pl <username> <password> <database_name> <mouse_genes> <mouse_transcripts>

use DBI;
use Data::Dumper;

my $user              = shift @ARGV or die $!;
my $password          = shift @ARGV or die $!;
my $database          = shift @ARGV or die $!;
my $mouse_genes       = shift @ARGV or die $!;
my $mouse_transcripts = shift @ARGV or die $!;

my $dbh = DBI->connect( "dbi:mysql:$database:localhost", "$user", "$password",
    { RaiseError => 1 } );
my %gene;

$/ = "\n>";

open( FILE, "gzip -d -c /data.dash/class2016/student/Mus_musculus.GRCm38.cdna.all.fa.gz |" )
        or die $!;

LOOP:
while ( <FILE> ) {

    my $line = $_;
    chomp $line;

    if ( $line =~ /[a-z]/ ) {

        my @array = split( "\t", $line );

        if ( m/gene:(\w+\d+\.\w+)/ ) {

            my $Ensembl_id = $1;

            if ( !exists $gene{$Ensembl_id} ) {
                $gene{$Ensembl_id} = 1;
            }
            else {
                next;
            }

            if ( m/gene_biotype:(\w+)/ ) {

                my $gene_biotype = $1;

                if ( m/gene_symbol:(\w+\D\d+)/ ) {

                    my $gene_symbol = $1;

                    if ( m/description:(\w+\s+\w+\s+\w+\s+)/ ) {

                        my $gene_description = $1;

                        if ( m/MGI:(\d+)/ ) {

                            my $MGI_accession = $1;
                            my $sth           = $dbh->prepare(
                                qq{insert into $mouse_genes (Ensembl_id,gene_biotype,gene_symbol,gene_description,MGI_accession) values ("$Ensembl_id","$gene_biotype","$gene_symbol","$gene_description","$MGI_accession")}
                            );
                            $sth->execute();
                            $sth->finish();

                            next LOOP;
                        }
                    }
                }
            }
        }
    }
}

close FILE;

$dbh->disconnect();

如果主键$Ensembl_id 重复,我如何使用exists 运算符移动到文件的下一行?

【问题讨论】:

    标签: perl hash exists perl-data-structures


    【解决方案1】:

    我以为我看到了一个与此非常相似的问题,但我找不到它

    解决方案是忘记散列并使用IGNORE 关键字来避免引发错误。 MySQL documentation 是这样说的:

    如果您使用 IGNORE 关键字,则在执行 INSERT 语句时发生的错误将被忽略。例如,如果没有 IGNORE,复制表中现有 UNIQUE 索引或 PRIMARY KEY 值的行会导致重复键错误并且语句被中止。使用 IGNORE,该行将被丢弃并且不会发生错误。忽略的错误可能会生成警告,尽管重复键错误不会。

    您还应该在 SQL 语句中使用占位符,所以它应该看起来像这样

    请注意,END_SQL 必须在其前后不带和空格出现。您可能希望在程序顶部定义 SQL 语句以避免破坏缩进

    my $sth = $dbh->prepare(<<END_SQL);
    INSERT IGNORE INTO $mouse_genes (
        Ensembl_id,
        gene_biotype,
        gene_symbol,
        gene_description,
        MGI_accession
    )
    VALUES ( ?, ?, ?, ?, ? )
    END_SQL
    
    $sth->execute($Ensembl_id, $gene_biotype, $gene_symbol, $gene_description, $MGI_accession);
    


    更新

    您的程序可以进行大量整理以使其更易于阅读。下面是我的写法

    #!/usr/bin/perl
    
    use strict;
    use warnings 'all';
    
    # usage script.pl <username> <password> <database_name> <mouse_genes> <mouse_transcripts>
    
    use DBI;
    
    my $user              = shift @ARGV or die $!;
    my $password          = shift @ARGV or die $!;
    my $database          = shift @ARGV or die $!;
    my $mouse_genes       = shift @ARGV or die $!;
    my $mouse_transcripts = shift @ARGV or die $!; # Not used at present
    
    my $dbh = DBI->connect( "dbi:mysql:$database:localhost", $user, $password,
            { RaiseError => 1, PrintError => 0 } );
    
    my $sth = $dbh->prepare( <<END_SQL );
    INSERT IGNORE INTO $mouse_genes (
        Ensembl_id,
        gene_biotype,
        gene_symbol,
        gene_description,
        MGI_accession
    )
    VALUES ( ?, ?, ?, ?, ? )
    END_SQL
    
    my $cmd = 'gzip -d -c /data.dash/class2016/student/Mus_musculus.GRCm38.cdna.all.fa.gz';
    
    open my $cmd_fh, '-|', $cmd or die $!;
    
    $/ = "\n>";
    
    while ( <$cmd_fh> ) {
    
        next unless my ( $ensembl_id )       = /gene:(\w+\d+\.\w+)/;
        next unless my ( $gene_biotype )     = /gene_biotype:(\w+)/;
        next unless my ( $gene_symbol )      = /gene_symbol:(\w+\D\d+)/;
        next unless my ( $gene_description ) = /description:(\w+\s+\w+\s+\w+)\s/;
        next unless my ( $mgi_accession )    = /MGI:(\d+)/;
    
        $sth->execute( $ensembl_id, $gene_biotype, $gene_symbol, $gene_description, $mgi_accession );
    }
    
    $dbh->disconnect;
    

    【讨论】:

    • 感谢您编辑我的问题和宝贵的反馈。在使用您的代码时,我收到一条关于 END_SQL 语句的错误消息。这是否意味着我不能在 perlDBI 中对 MySQL 表使用 END_SQL 语句?
    • @OlhaKholod:不客气。希望您理解并认可我的重构。
    • @OlhaKholod END_SQLhere-document 的一部分。查看here 并搜索&lt;&lt;EOF 部分。
    • @OlhaKholod:很抱歉,我错过了您评论的结尾。这取决于错误消息的内容,但是您是否阅读了我的说明,即 terminating END_SQL 在同一行之前或之后必须没有空格?这不是语言的特殊部分:我在my $sth = $dbh-&gt;prepare( &lt;&lt;END_SQL ) 中指定字符串的结尾将由仅包含END_SQL 的行标记,并且可以在其中放置任何内容。如果这没有帮助,请告诉我错误消息。
    • @PerlDuck:不,我正在考虑使用哈希来避免使用重复键写入记录:与此前提相同。我认为它已被删除。
    【解决方案2】:

    我想出了如何使用哈希来克服重复的键:

    #!/usr/bin/perl -w
    
    #this script inserts sequences from Mus_musculus.GRCm38.cdna.all.fa.gz into mouse_genes table
    #usage lab5_2.pl <username> <password> <database_name> <mouse_genes> <mouse_transcripts>
    
    use DBI;
    
    use Data::Dumper;
    
    my $user = shift @ARGV or die $!;
    my $password = shift @ARGV or die $!;
    my $database = shift @ARGV or die $!;
    my $mouse_genes = shift @ARGV or die $!;
    
    my $dbh = DBI->connect("dbi:mysql:$database:localhost",
                           "$user",
                           "$password",
                           {RaiseError => 1}
                           );
    my %gene;
    
    $/ = "\n>";
    
    open (FILE, "gzip -d -c /data.dash/class2016/student/Mus_musculus.GRCm38.cdna.all.fa.gz |") or die $!;
    
     LOOP: while (<FILE>) {
         if (m/gene:(\w+\d+\.\d+)/) {
             my $Ensembl_id = $1;
             if ( !exists $gene{$Ensembl_id} ) {
                 $gene{$Ensembl_id} = 1;
                 if (m/gene_biotype:(\w+)/) {
                     my $gene_biotype = $1;
                     my $gene_symbol;
                     if (m/gene_symbol:(\w+\D\d+)/) {
                         $gene_symbol = $1;
                     }
                     if (! defined $gene_symbol) {
                         $gene_symbol = "NULL";
                     }
                     if (m/description:([^\[]*)/) {
                         my $gene_description = $1;
                         if (m/MGI:(\d+)/) {
                             my $MGI_accession = $1;
                             $sth = $dbh->prepare (qq{insert into mouse_genes (Ensembl_id, gene_biotype, gene_symbol, gene_description, MGI_accession) values ("$Ensembl_id","$gene_biotype","$gene_symbol","$gene_description","$MGI_accession")});
                             $sth->execute();
                             $sth->finish();
                             next LOOP;
                         }
                     }
                 }
    close FILE;
    $dbh->disconnect ();
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-01-13
      • 1970-01-01
      • 1970-01-01
      • 2021-04-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多