【问题标题】:Add new column with a specific row value in U-SQL在 U-SQL 中添加具有特定行值的新列
【发布时间】:2017-07-11 06:13:24
【问题描述】:

我必须从 U-SQL 中的文本文件中提取记录。第一行与其他行不同,它包含日期。我必须跳过第一行,但我必须从第一行复制日期并将其粘贴到所有行的新列中。因此,在我最终的 u-sql 输出查询中,每行的第一列将包含从文件第一行复制的相同数据。更多信息请查看附件图片文件。

请建议我正确的 u-sql 查询来完成这项任务。

【问题讨论】:

    标签: c# azure azure-data-lake u-sql


    【解决方案1】:

    这是另一种方式。我使用示例中的 SearchLog.tsv 来演示这一点。在文件的顶部,我添加了 FROM 01JAN17 TO 31JAN17 行。

    //Skip the first row and read all the other rows
    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        FROM @"/Samples/Data/SearchLogWithHeader.tsv"
        USING Extractors.Tsv(skipFirstNRows: 1);
    
    //Extract all the text in the same file but don't parse out the individual columns
    @searchlogAllText = 
        EXTRACT rowText string
        FROM @"/Samples/Data/SearchLogWithHeader.tsv"
        USING Extractors.Text(delimiter: '\n');
    
    //Find a pattern that works for you and use the .NET expressions that match the string
    @searchlogHeaderDate = 
        SELECT rowText.Split(' ')[1] AS FromDate FROM @searchlogAllText WHERE rowText.StartsWith("FROM");                                    
    
    @output = SELECT * FROM @searchlogHeaderDate CROSS JOIN @searchlog;
    
    OUTPUT @output 
        TO @"/Samples/Output/SearchLog_output.tsv"
        USING Outputters.Tsv();
    

    输入:

    FROM 01JAN17 TO 31JAN17
    399266  2/15/2012 11:53:16 AM   en-us   how to make nachos  73  www.nachos.com;www.wikipedia.com    NULL
    382045  2/15/2012 11:53:18 AM   en-gb   best ski resorts    614 skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm   ski-europe.com;www.travelersdigest.com/ski_resorts.htm
    382045  2/16/2012 11:53:20 AM   en-gb   broken leg  74  mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture   mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture
    106479  2/16/2012 11:53:50 AM   en-ca   south park episodes 24  southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall   southparkstudios.com
    906441  2/16/2012 11:54:01 AM   en-us   cosmos  1213    cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos NULL
    351530  2/16/2012 11:54:01 AM   en-fr   microsoft   241 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
    640806  2/16/2012 11:54:02 AM   en-us   wireless headphones 502 www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com www.amazon.com;store.apple.com
    304305  2/16/2012 11:54:03 AM   en-us   dominos pizza   60  dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos  dominos.com
    460748  2/16/2012 11:54:04 AM   en-us   yelp    1270    yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp  yelp.com
    354841  2/16/2012 11:59:01 AM   en-us   how to run  610 running.about.com;ehow.com;go.com   running.about.com;ehow.com
    354068  2/16/2012 12:00:33 PM   en-mx   what is sql 422 wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL    wikipedia.org/wiki/SQL
    674364  2/16/2012 12:00:55 PM   en-us   mexican food redmond    283 eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com  NULL
    347413  2/16/2012 12:11:55 PM   en-gr   microsoft   305 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
    848434  2/16/2012 12:12:35 PM   en-ch   facebook    10  facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook facebook.com
    604846  2/16/2012 12:13:55 PM   en-us   wikipedia   612 wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia  wikipedia.org
    840614  2/16/2012 12:13:56 PM   en-us   xbox    1220    xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360    xbox.com/xbox360
    656666  2/16/2012 12:15:55 PM   en-us   hotmail 691 hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail    NULL
    951513  2/16/2012 12:17:00 PM   en-us   pokemon 63  pokemon.com;pokemon.com/us;serebii.net  pokemon.com
    350350  2/16/2012 12:18:17 PM   en-us   wolfram 30  wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram    NULL
    641615  2/16/2012 12:19:55 PM   en-us   kahn    119 khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name) khanacademy.org
    321065  2/16/2012 12:20:03 PM   en-us   clothes 732 gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars   footballfanatics.com/college_washington_state_cougars
    651777  2/16/2012 12:20:33 PM   en-us   food recipes    183 allrecipes.com;foodnetwork.com;simplyrecipes.com    foodnetwork.com
    666352  2/16/2012 12:21:03 PM   en-us   weight loss 630 en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com webmd.com/diet
    

    输出:

    "01JAN17"   399266  2012-02-15T11:53:16.0000000 "en-us" "how to make nachos"    73  "www.nachos.com;www.wikipedia.com"  "NULL"
    "01JAN17"   382045  2012-02-15T11:53:18.0000000 "en-gb" "best ski resorts"  614 "skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm" "ski-europe.com;www.travelersdigest.com/ski_resorts.htm"
    "01JAN17"   382045  2012-02-16T11:53:20.0000000 "en-gb" "broken leg"    74  "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture" "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture"
    "01JAN17"   106479  2012-02-16T11:53:50.0000000 "en-ca" "south park episodes"   24  "southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall" "southparkstudios.com"
    "01JAN17"   906441  2012-02-16T11:54:01.0000000 "en-us" "cosmos"    1213    "cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos"   "NULL"
    "01JAN17"   351530  2012-02-16T11:54:01.0000000 "en-fr" "microsoft" 241 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com"   "NULL"
    "01JAN17"   640806  2012-02-16T11:54:02.0000000 "en-us" "wireless headphones"   502 "www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com"   "www.amazon.com;store.apple.com"
    "01JAN17"   304305  2012-02-16T11:54:03.0000000 "en-us" "dominos pizza" 60  "dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos"    "dominos.com"
    "01JAN17"   460748  2012-02-16T11:54:04.0000000 "en-us" "yelp"  1270    "yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp"    "yelp.com"
    "01JAN17"   354841  2012-02-16T11:59:01.0000000 "en-us" "how to run"    610 "running.about.com;ehow.com;go.com" "running.about.com;ehow.com"
    "01JAN17"   354068  2012-02-16T12:00:33.0000000 "en-mx" "what is sql"   422 "wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL"  "wikipedia.org/wiki/SQL"
    "01JAN17"   674364  2012-02-16T12:00:55.0000000 "en-us" "mexican food redmond"  283 "eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com"    "NULL"
    "01JAN17"   347413  2012-02-16T12:11:55.0000000 "en-gr" "microsoft" 305 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com"   "NULL"
    "01JAN17"   848434  2012-02-16T12:12:35.0000000 "en-ch" "facebook"  10  "facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook"   "facebook.com"
    "01JAN17"   604846  2012-02-16T12:13:55.0000000 "en-us" "wikipedia" 612 "wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia"    "wikipedia.org"
    "01JAN17"   840614  2012-02-16T12:13:56.0000000 "en-us" "xbox"  1220    "xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360"  "xbox.com/xbox360"
    "01JAN17"   656666  2012-02-16T12:15:55.0000000 "en-us" "hotmail"   691 "hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail"  "NULL"
    "01JAN17"   951513  2012-02-16T12:17:00.0000000 "en-us" "pokemon"   63  "pokemon.com;pokemon.com/us;serebii.net"    "pokemon.com"
    "01JAN17"   350350  2012-02-16T12:18:17.0000000 "en-us" "wolfram"   30  "wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram"  "NULL"
    "01JAN17"   641615  2012-02-16T12:19:55.0000000 "en-us" "kahn"  119 "khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name)"   "khanacademy.org"
    "01JAN17"   321065  2012-02-16T12:20:03.0000000 "en-us" "clothes"   732 "gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars" "footballfanatics.com/college_washington_state_cougars"
    "01JAN17"   651777  2012-02-16T12:20:33.0000000 "en-us" "food recipes"  183 "allrecipes.com;foodnetwork.com;simplyrecipes.com"  "foodnetwork.com"
    "01JAN17"   666352  2012-02-16T12:21:03.0000000 "en-us" "weight loss"   630 "en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com"   "webmd.com/diet"
    

    【讨论】:

      【解决方案2】:

      我能够使用普通 U-SQL(即不是自定义 UDO)和 sample file

      USING rx = System.Text.RegularExpressions.Regex;
      
      DECLARE @inputFilepath string = "input/input71.txt";
      DECLARE @outputFilepath string = "output/output71.csv";
      
      
      // Get the first line; use silent option to skip all other lines 
      // ie which have more than one column
      @file =
          EXTRACT headerLine string
          FROM @inputFilepath
          USING Extractors.Text(delimiter : '|', silent : true);
      
      
      // Get the start date from the header
      @header =
          SELECT
              headerLine,
              rx.Match(headerLine, @"FROM (?<startDate>\d{2}[A-Z]{3}\d{2}) TO (?<endDate>\d{2}[A-Z]{3}\d{2})").Groups["startDate"].ToString() AS startDate
          FROM @file
          WHERE headerLine.Contains("FROM");
      
      
      
      // Get the rest of the lines; skip the header row explicity
      // don't use 'silent' as it should not be required (as we're skipping header row)
      @body =
          EXTRACT runDate string,
                  col1 int,
                  col2 int,
                  col3 int
          FROM @inputFilepath
          USING Extractors.Text(delimiter : '|', skipFirstNRows : 1);
      
      
      @result =
          SELECT h.startDate, p.*
          FROM @header AS h
               CROSS JOIN
                   @body AS p;
      
      
      // Export as csv
      OUTPUT @result
      TO @outputFilepath
      USING Outputters.Csv(quoting:false);
      

      我的结果:

      这是一个演示 U-SQL 强大功能的简单示例,将 RegEx 与集合操作配对。看看类似的东西是否适合你。

      【讨论】: