【问题标题】:Data insert issue数据插入问题
【发布时间】:2021-02-02 17:58:21
【问题描述】:

所以当我将 CSV 文件添加到我的 HQL 代码并在 HDFS 上运行它时,我遇到了这个问题。 我发现在插入数据时,它在分区部分中得到空值,并且某些列被删除我尝试了许多不同的方法来插入数据,但我仍然得到这个奇怪的符号和丢失的列,就像它无法读取 CSV 文件一样, 这是一张图片 enter image description here 这是代码`

CREATE database covid_db;

use covid_db;


SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=500;
set hive.exec.max.dynamic.partitions.pernode=500;


CREATE TABLE IF NOT EXISTS covid_db.covid_staging 
(
 Country                            STRING,
 Total_Cases                        DOUBLE,
 New_Cases                          DOUBLE,
 Total_Deaths                       DOUBLE,
 New_Deaths                         DOUBLE,
 Total_Recovered                    DOUBLE,
 Active_Cases                       DOUBLE,
 Serious                            DOUBLE,
 Tot_Cases                          DOUBLE,
 Deaths                             DOUBLE,
 Total_Tests                        DOUBLE,
 Tests                              DOUBLE,
 CASES_per_Test                     DOUBLE,
 Death_in_Closed_Cases              STRING,
 Rank_by_Testing_rate               DOUBLE,
 Rank_by_Death_rate                 DOUBLE,
 Rank_by_Cases_rate                 DOUBLE,
 Rank_by_Death_of_Closed_Cases      DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_LZ'
tblproperties ("skip.header.line.count"="1", "serialization.null.format" = "''");

CREATE EXTERNAL TABLE IF NOT EXISTS covid_db.covid_ds_partitioned 
(
 Country                            STRING,
 Total_Cases                        DOUBLE,
 New_Cases                          DOUBLE,
 Total_Deaths                       DOUBLE,
 New_Deaths                         DOUBLE,
 Total_Recovered                    DOUBLE,
 Active_Cases                       DOUBLE,
 Serious                            DOUBLE,
 Tot_Cases                          DOUBLE,
 Deaths                             DOUBLE,
 Total_Tests                        DOUBLE,
 Tests                              DOUBLE,
 CASES_per_Test                     DOUBLE,
 Death_in_Closed_Cases              STRING,
 Rank_by_Testing_rate               DOUBLE,
 Rank_by_Death_rate                 DOUBLE,
 Rank_by_Cases_rate                 DOUBLE,
 Rank_by_Death_of_Closed_Cases      DOUBLE
)
PARTITIONED BY (COUNTRY_NAME STRING)
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_PARTITIONED';

FROM
covid_db.covid_staging
INSERT INTO TABLE covid_db.covid_ds_partitioned PARTITION(COUNTRY_NAME)
SELECT *,Country WHERE Country is not null;


CREATE EXTERNAL TABLE covid_db.covid_final_output 
(
 TOP_DEATH                          STRING,
 TOP_TEST                           STRING
)
PARTITIONED BY (COUNTRY_NAME STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_FINAL_OUTPUT';

`

【问题讨论】:

    标签: hive hdfs hql


    【解决方案1】:

    1st:您正在检查文件内容,分区列未存储在文件中,它存储在元数据中。动态创建的分区还有 key=value 格式的目录。因此,您在文件中看到的最后一列不是分区列,而是 Rank_by_Death_of_Closed_Cases。

    第二个:您没有在第二个表 DDL 中指定分隔符以及 NULL 格式。默认分隔符是 '\001' (Ctrl-A)。您可以指定分隔符,例如 TAB (\t) 和所需的 NULL:

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    NULL DEFINED AS ''
    STORED AS TEXTFILE;
    

    但如果你希望能够区分 NULL 和空字符串,最好不要重新定义 NULL 格式。

    【讨论】:

      猜你喜欢
      • 2020-01-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-02-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多