【问题标题】:Slow execution of USQLUSQL执行缓慢
【发布时间】:2016-11-10 13:45:16
【问题描述】:

我创建了一个简单的脚本来在两个字符串之间评分。请在下面找到 USQL 和 BackEnd .net 代码

CN_Matcher.usql:

REFERENCE ASSEMBLY master.FuzzyString;

@searchlog =
        EXTRACT ID int,
                Input_CN string,
                Output_CN string
        FROM "/CN_Matcher/Input/sample.txt"
        USING Extractors.Tsv();

@CleansCheck =
    SELECT ID,Input_CN, Output_CN, CN_Validator.trial.cleanser(Input_CN) AS Input_CN_Cleansed,
           CN_Validator.trial.cleanser(Output_CN) AS Output_CN_Cleansed
    FROM @searchlog;

@CheckData= SELECT ID,Input_CN, Output_CN, Input_CN_Cleansed, Output_CN_Cleansed,
                   CN_Validator.trial.Hamming(Input_CN_Cleansed, Output_CN_Cleansed) AS HammingScore,
                   CN_Validator.trial.LevinstienDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS LevinstienDistance,
                   FuzzyString.ComparisonMetrics.JaroWinklerDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS JaroWinklerDistance
                                       FROM @CleansCheck;

OUTPUT @CheckData
    TO "/CN_Matcher/CN_Full_Run.txt"
    USING Outputters.Tsv();

CN_Matcher.usql.cs:

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace CN_Validator
{
    public static class trial
    {

        public static string cleanser(string val)
        {
            List<string> wordsToRemove = "l.p. registered pc bldg pllc lp. l.c. div. national l p l.l.c international r. limited school azioni joint co-op corporation corp., (corp) inc., societa company llp liability l.l.l.p llc bancorporation manufacturing c dst (inc) jv ltd. llc. technology ltd., s.a. mfg rllp incorporated per venture l.l.p c. p.l.l.c l.p.. p. partnership corp co-operative s.p.a tech schl bancorp association lllp n r ltd inc. l.l.p. p.c. co district int intl assn. sa inc l.p co, co. division lc intl. lp professional corp. a l. l.l.c. building r.l.l.p co.,".Split(' ').ToList();
            return string.Join(" ", val.ToLower().Split(' ').Except(wordsToRemove));
        }

        public static int Hamming(string source, string target)
        {   
            int distance = 0;
            if (source.Length == target.Length)
            {
                for (int i = 0; i < source.Length; i++)
                {
                    if (!source[i].Equals(target[i]))
                    {
                        distance++;
                    }
                }
                return distance;
            }
            else { return 99999; }
        }

        public static int LevinstienDistance(string source, string target)
        {
            int n = source.Length;
            int m = target.Length;
            int[,] d = new int[n + 1, m + 1]; // matrix
            int cost; // cost
            // Step 1
            if (n == 0) return m;
            if (m == 0) return n;
            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;
            for (int i = 1; i <= n; i++)
            {
                for (int j = 1; j <= m; j++)
                {
                    cost = (target.Substring(j - 1, 1) == source.Substring(i - 1, 1) ? 0 : 1);
                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }
            return d[n, m];
        }

    }
}

我运行了一个包含 100 个输入的示例批次,并将并行度设置为 1,优先级设置为 1000。作业在 1.6 分钟内完成

我想用 1000 个输入测试同一个作业,并将并行度设置为 1,优先级设置为 1000,根据我的计算 因为 100 个输入需要 1.6 分钟,我认为 1000 个输入需要大约 20 分钟,但是它运行了 50 多分钟,我没有看到任何进展

所以我又添加了 100 个输入作业,并测试了它的运行方式与上次相同。所以 我想增加并行度并将其增加到 3 并再次运行它甚至在 1 小时后也没有完成。

JOB_ID=07c0850d-0770-4430-a288-5cddcfc26699

主要问题是我看不到任何进展或状态。

如果我做错了什么,请告诉我。

在 USQL 中有没有使用构造函数?因为如果我能够做到这一点,我将不需要一次又一次地执行相同的清洁步骤。

【问题讨论】:

    标签: c# azure-data-lake u-sql


    【解决方案1】:

    我假设您正在使用文件集语法来指定 1000 个文件?不幸的是,当前文件集的默认实现不能很好地扩展,并且编译(准备)阶段将需要很长时间(就像执行一样)。我们目前在预览版中有更好的实现。能否请您给我发一封邮件到 usql at Microsoft dot com,我会告诉您如何试用预览版实现。

    谢谢 迈克尔

    【讨论】:

    • 嗨,迈克尔,它不是 1000 个文件,而是一个包含 1000 个输入的文件。我给你发邮件。感谢您的回复。
    【解决方案2】:

    我研究了一种更基于集合的方法。例如,与其将要删除的单词保存在代码隐藏文件中,不如将它们保存在 U-SQL 表中,以便轻松添加到:

    CREATE TABLE IF NOT EXISTS dbo.wordsToRemove
    (
        word string,
    
        INDEX cdx_wordsToRemvoe CLUSTERED (word ASC) 
        DISTRIBUTED BY HASH (word)
    );
    
    INSERT INTO dbo.wordsToRemove ( word )
    SELECT word
    FROM (
    VALUES
        ( "l.p." ),
        ( "registered" ),
        ( "pc" ),
        ( "bldg" ),
        ( "pllc" ),
        ( "lp." ),
        ( "l.c." ),
        ( "div." ),
        ( "national" ),
        ( "l" ),
        ( "p" ),
        ( "l.l.c" ),
        ( "international" ),
        ( "r." ),
        ( "limited" ),
        ( "school" ),
        ( "azioni" ),
        ( "joint" ),
        ( "co-op" ),
        ( "corporation" ),
        ( "corp.," ),
        ( "(corp)" ),
        ( "inc.," ),
        ( "societa" ),
        ( "company" ),
        ( "llp" ),
        ( "liability" ),
        ( "l.l.l.p" ),
        ( "llc" ),
        ( "bancorporation" ),
        ( "manufacturing" ),
        ( "c" ),
        ( "dst" ),
        ( "(inc)" ),
        ( "jv" ),
        ( "ltd." ),
        ( "llc." ),
        ( "technology" ),
        ( "ltd.," ),
        ( "s.a." ),
        ( "mfg" ),
        ( "rllp" ),
        ( "incorporated" ),
        ( "per" ),
        ( "venture" ),
        ( "l.l.p" ),
        ( "c." ),
        ( "p.l.l.c" ),
        ( "l.p.." ),
        ( "p." ),
        ( "partnership" ),
        ( "corp" ),
        ( "co-operative" ),
        ( "s.p.a" ),
        ( "tech" ),
        ( "schl" ),
        ( "bancorp" ),
        ( "association" ),
        ( "lllp" ),
        ( "n" ),
        ( "r" ),
        ( "ltd" ),
        ( "inc." ),
        ( "l.l.p." ),
        ( "p.c." ),
        ( "co" ),
        ( "district" ),
        ( "int" ),
        ( "intl" ),
        ( "assn." ),
        ( "sa" ),
        ( "inc" ),
        ( "l.p" ),
        ( "co," ),
        ( "co." ),
        ( "division" ),
        ( "lc" ),
        ( "intl." ),
        ( "lp" ),
        ( "professional" ),
        ( "corp." ),
        ( "a" ),
        ( "l." ),
        ( "l.l.c." ),
        ( "building" ),
        ( "r.l.l.p" ),
        ( "co.," )
    ) AS words(word);
    

    然后为了进行比较,我将原始短语分开,删除了我们不想要的单词,然后将短语重新组合在一起,如下所示:

    //DECLARE @inputFile string = "input/input.csv"; // 500 companies, Standard & Poor 500 companies from wikipedia
    DECLARE @inputFile string = "input/input2.csv"; // 850,000 companies, part 1 of extract from Companies House
    
    
    @searchlog =
        EXTRACT id int,
                Input_CN string,
                Output_CN string
        FROM @inputFile
        USING Extractors.Csv(silent : true);
        //USING Extractors.Csv(skipFirstNRows:1);
    
    
    // Split the input string to remove unwanted words
    @Input_CN =
        SELECT id,
               new SQL.ARRAY<string>(Input_CN.Split(' ')) AS splitWords
        FROM @searchlog;
    
    
    @Output_CN =
        SELECT id,
               new SQL.ARRAY<string>(Output_CN.Split(' ')) AS splitWords
        FROM @searchlog;
    
    
    // Remove unwanted words from input string
    @Input_CN =
        SELECT *
        FROM
        (
            SELECT o.id,
                   x.splitWord.ToLower() AS splitWord
            FROM @Input_CN AS o
                 CROSS APPLY
                     EXPLODE(splitWords) AS x(splitWord)
        ) AS y    
        ANTISEMIJOIN
            dbo.wordsToRemove AS w
        ON y.splitWord == w.word;
    
    // Remove unwanted words from output string
    @Output_CN =
        SELECT *
        FROM
        (
            SELECT o.id,
                   x.splitWord.ToLower() AS splitWord
            FROM @Output_CN AS o
                 CROSS APPLY
                     EXPLODE(splitWords) AS x(splitWord)
        ) AS y
        ANTISEMIJOIN
            dbo.wordsToRemove AS w
        ON y.splitWord == w.word;
    
    
    
    
    // Put the input string back together again
    @Input_CN =
        SELECT id,
               String.Join( " ", ARRAY_AGG (splitWord) ) AS Input_CN_Cleansed
        FROM @Input_CN
        GROUP BY id;
    
    
    @Output_CN =
        SELECT id,
               String.Join( " ", ARRAY_AGG (splitWord) ) AS Output_CN_Cleansed
        FROM @Output_CN
        GROUP BY id;
    
    
    
    @output =
        SELECT i.id,
               i.Input_CN_Cleansed,
               o.Output_CN_Cleansed,
               CN_Validator.trial.Hamming(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS HammingScore,
               CN_Validator.trial.LevinstienDistance(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS LevinstienDistance
        FROM @Input_CN AS i
             INNER JOIN
                 @Output_CN AS o
             ON i.id == o.id;
    
    
    
    OUTPUT @output
        TO "/output/output.csv"
        USING Outputters.Csv();
    

    我发现性能相似,但设计可能更易于维护。无论如何,我的代码只用了几分钟就运行了 850+k 条记录,而不是 50+ 分钟,所以也许还有另一个问题。注意我错过了 FuzzyString 库,所以在我的测试中没有包含它 - 它可以解释差异。

    如果您从 Microsoft 那里获得了有关此问题的更新,请回帖到此线程,如果您愿意,甚至可以将其标记为答案。

    【讨论】:

    • 如果我能解决这个问题,我一定会在这里发布。感谢代码改造。由于不建议对 SQL 进行规范化,因此我想在 .net 端执行此操作,但您的代码看起来可维护,并且看起来您正在使用 USQL 的全部功能。