【问题标题】:Find and Extract value after specific String from a file using bash shell script?使用bash shell脚本从文件中查找和提取特定字符串后的值?
【发布时间】:2020-08-04 17:46:18
【问题描述】:

我有一个包含以下详细信息的文件: 文件.txt

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `dv.par_kst`( |
|   `col1` string,                                   |
|   `col2` string,                                   |
|   `col3` int,                                      |
|   `col4` int,                                      |
|   `col5` string,                                   |
|   `col6` float,                                    |
|   `col7` int,                                      |
|   `col8` string,                                   |
|   `col9` string,                                   |
|   `col10` int,                                     |
|   `col11` int,                                     |
|   `col12` string,                                  |
|   `col13` float,                                   |
|   `col14` string,                                  |
|   `col15` string)                                  |
| PARTITIONED BY (                                   |
|   `part_col1` int,                                 |
|   `part_col2` int)                                 |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://nameservicets1/dv/hdfsdata/par_kst' |
| TBLPROPERTIES (                                    |
|   'spark.sql.create.version'='2.2 or prior',       |
|   'spark.sql.sources.schema.numPartCols'='2',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"col1","type":"string","nullable":true,"metadata":{}},{"name":"col2","type":"string","nullable":true,"metadata":{}},{"name":"col3","type":"integer","nullable":true,"metadata":{}},{"name":"col4","type":"integer","nullable":true,"metadata":{}},{"name":"col5","type":"string","nullable":true,"metadata":{}},{"name":"col6","type":"float","nullable":true,"metadata":{}},{"name":"col7","type":"integer","nullable":true,"metadata":{}},{"name":"col8","type":"string","nullable":true,"metadata":{}},{"name":"col9","type":"string","nullable":true,"metadata":{}},{"name":"col10","type":"integer","nullable":true,"metadata":{}},{"name":"col11","type":"integer","nullable":true,"metadata":{}},{"name":"col12","type":"string","nullable":true,"metadata":{}},{"name":"col13","type":"float","nullable":true,"metadata":{}},{"name":"col14","type":"string","nullable":true,"metadata":{}},{"name":"col15","type":"string","nullable":true,"metadata":{}},{"name":"part_col1","type":"integer","nullable":true,"metadata":{}},{"name":"part_col2","type":"integer","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='part_col1',  |
|   'spark.sql.sources.schema.partCol.1'='part_col2',  |
|   'transient_lastDdlTime'='1587487456')            |
+----------------------------------------------------+

我想从上面的文件中提取 PARTITIONED BY 详细信息。

Desired output :

part_col1 , part_col2

这些 PARTITIONED BY 不是固定的,意味着对于其他一些文件它可能包含 3 个或更多,所以我想提取所有 PARTITIONED BY。

PARTITIONED BY 和 ROW FORMAT SERDE 之间的所有值,删除空格“`”和数据类型!

您能帮我解决这个问题吗?

【问题讨论】:

    标签: linux shell perl unix sh


    【解决方案1】:
    sed -nr '/PARTITIONED BY/,/ROW FORMAT SERDE/p' a.txt|sed -nr '/`/p'|cut -d '`' -f 2|xargs -n 1 echo -n " "
    

    【讨论】:

    • 并且不是在 file.txt 中有记录,我必须执行如下: par_col=beeline --silent -u "$BEELINE_URL" -e "$sql" where sql="show create table dvs_wk.par_kst" Par_col 有上述结果但是当我像这样:result=sed -n '/PARTITIONED BY/,/ROW FORMAT SERDE/p' $par_col | sed -n '//p'|cut -d '' -f 2|xargs -n 1 echo -n " " 它给了我一个错误。
    • sed 打印 PARTITIONED BY 和 ROW FORMAT SERDE 之间的所有字符串(包括它们),然后另一个 sed 仅打印带有“" character, than cut command split string in column by "”的字符串并打印第二列(您的号码),然后 xargs 抓取所有数字并用空格作为分隔符打印它们。可能不是最好的管道,但它适用于您的示例。
    【解决方案2】:
    my $text = do { local $/; <DATA> };
    
    my @partitioned = ();
    
    $text=~s#PARTITIONED BY\s*\(([^\(\)]*)\)# my $fulcontent=$1; 
    push (@partitioned, $1) while($fulcontent=~m/\`([^\`]+)\`/g);
    ($fulcontent);
    #egs;
    
    print join "\, ", @partitioned;
    

    输出:

    part_col1,part_col2

    【讨论】:

      【解决方案3】:

      当您的结果布局无关紧要时,您可以要求sed 考虑开始和结束标记之间的行,并且仅在可以在 2 个反引号之间找到字段时打印这样的行。

      sed -rn '/PARTITIONED BY/,/ROW FORMAT/s/.*`(.*)`.*/\1/p' file.txt
      

      可以根据需要将结果组合成一行

      printf "%s , " $(sed -rn '/PARTITIONED BY/,/ROW FORMAT/s/.*`(.*)`.*/\1 /p' file.txt) |
         sed 's/ , $/\n/'
      

      【讨论】:

        【解决方案4】:

        小perl脚本

        • 将整个文件读入$data变量
        • PARTITIONED BY (....)之间全选
        • 仅选择 `
        • 之间的元素进入数组
        • 打印结果加入,
        use strict;
        use warnings;
        use feature 'say';
        
        my $data = do { local $/; <> };
        my $re   = 'PARTITIONED BY \((.*?)\)';
        
        $data =~ /$re/sg;
        
        my @part = $1 =~ /`(.*?)`/sg;
        
        say join ', ', @part;
        

        【讨论】:

          猜你喜欢
          • 2015-06-19
          • 2019-10-15
          • 1970-01-01
          • 2020-08-04
          • 1970-01-01
          • 2022-11-14
          • 1970-01-01
          • 2020-06-18
          相关资源
          最近更新 更多