【问题标题】:SQOOP export HDFS to MYSQL dbSQOOP 将 HDFS 导出到 MYSQL 数据库
【发布时间】:2021-11-08 02:27:40
【问题描述】:

我正在尝试将 HDFS 导出到 MYSQL 数据库。我找到了各种不同的解决方案,但都没有奏效,我什至尝试从文件中删除 WINDOWS-1251 字符。

作为一个小总结 - 我正在使用带有 Hortonworks 映像的 virtualbox 进行此操作。

我在默认数据库中的 HIVE:

CREATE EXTERNAL TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100),
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata'
TBLPROPERTIES ("skip.header.line.count"="1");

/hadoop/airqualitydata HDFS 中包含的文件(为了确定,删除了 win1251 字符)。

请注意,可以通过在 hive 中查询 SELECT * FROM airqualitydata 来可视化此数据。

sensor_id;sensor_type;location;lat;lon;timestamp;P1;durP1;ratioP1;P2;durP2;ratioP2
9710;SDS011;4894;43.226;27.934;2021-09-09T00:00:12;70;;;20;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:02:41;83;;;0.93;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:05:14;0.80;;;0.73;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:07:42;0.50;;;0.50;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:10:10;57;;;0.80;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:12:39;0.40;;;0.40;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:15:07;0.70;;;0.70;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:17:35;2;;;0.47;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:20:04;90;;;0.63;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:22:34;0.57;;;0.57;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:25:01;0.73;;;0.60;;

MYSQL 数据库和表:

CREATE DATABASE airquality CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
CREATE TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100), 
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100)
);

SQOOP CLI 调用:

sqoop export --connect "jdbc:mysql://localhost:3306/airquality?useUnicode=true&characterEncoding=WINDOWS-1251" --username root --password hortonworks1 --export-dir hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata --table airqualitydata --input-fields-terminated-by "\073" --input-lines-terminated-by "\n" -m 1

我删除了?useUnicode=true&characterEncoding=WINDOWS-1251,但没有成功。 我也无法从终端中给出的 URL 访问日志,所以我只得到了这个失败:

Warning: /usr/hdp/2.6.5.0-292/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/09/12 04:04:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.5.0-292
21/09/12 04:04:40 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/09/12 04:04:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/09/12 04:04:40 INFO tool.CodeGenTool: Beginning code generation
Sun Sep 12 04:04:40 UTC 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.6.5.0-292/hadoop-mapreduce
Note: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/09/12 04:04:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.jar
21/09/12 04:04:42 INFO mapreduce.ExportJobBase: Beginning export of airqualitydata
21/09/12 04:04:43 INFO client.RMProxy: Connecting to ResourceManager at sandbox-hdp.hortonworks.com/172.18.0.2:8032
21/09/12 04:04:43 INFO client.AHSProxy: Connecting to Application History server at sandbox-hdp.hortonworks.com/172.18.0.2:10200
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO mapreduce.JobSubmitter: number of splits:1
21/09/12 04:04:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1631399426919_0028
21/09/12 04:04:51 INFO impl.YarnClientImpl: Submitted application application_1631399426919_0028
21/09/12 04:04:51 INFO mapreduce.Job: The url to track the job: http://sandbox-hdp.hortonworks.com:8088/proxy/application_1631399426919_0028/
21/09/12 04:04:51 INFO mapreduce.Job: Running job: job_1631399426919_0028
21/09/12 04:04:59 INFO mapreduce.Job: Job job_1631399426919_0028 running in uber mode : false
21/09/12 04:04:59 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 04:05:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 04:05:04 INFO mapreduce.Job: Job job_1631399426919_0028 failed with state FAILED due to: Task failed task_1631399426919_0028_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

21/09/12 04:05:04 INFO mapreduce.Job: Counters: 8
        Job Counters
                Failed map tasks=1
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2840
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=2840
                Total vcore-milliseconds taken by all map tasks=2840
                Total megabyte-milliseconds taken by all map tasks=710000
21/09/12 04:05:04 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 21.2627 seconds (0 bytes/sec)
21/09/12 04:05:04 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Exported 0 records.
21/09/12 04:05:04 ERROR mapreduce.ExportJobBase: Export job failed!
21/09/12 04:05:04 ERROR tool.ExportTool: Error during export: Export job failed!

任何指示都会有所帮助,谢谢!

编辑#1: 根据上面的 cmets,使用:

sqoop export --connect jdbc:mysql://localhost:3306/airquality  --table airqualitydata  --username root --password hortonworks1 --hcatalog-database default --hcatalog-table airqualitydata --verbose

或基本上(用于复制的人)

sqoop export --connect jdbc:mysql://<host:port>/<mysql db> --table <mysql table> --username <mysql_user> --password <mysqlpass> --hcatalog-database <hive_db> --hcatalog-table <hive_table> --verbose

我得到它把数据放在 MYSQL 中。但是,它也放置了标题行。此外,当运行两次(我相信它应该覆盖数据)时,它会导致数据在表中出现两次。

+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | timestamp           | p1   | durp1 | ratiop1 | p2   | durp2 | ratiop2 |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:55:49 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.73 |       |         | 0.60 |       |         |
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:58:02 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.73 |       |         | 0.60 |       |         |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+

HIVE 中的数据没问题(那里没有标题行)。这可能是什么原因造成的?

我也有一个例外,但它总体上完成了,这重要吗?

21/09/12 05:57:41 INFO mapreduce.Job: Running job: job_1631399426919_0035
21/09/12 05:57:55 INFO mapreduce.Job: Job job_1631399426919_0035 running in uber mode : false
21/09/12 05:57:55 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 05:58:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 05:58:05 INFO mapreduce.Job: Job job_1631399426919_0035 completed successfully
21/09/12 05:58:06 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=345759
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2597
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=4966
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=4966
                Total vcore-milliseconds taken by all map tasks=4966
                Total megabyte-milliseconds taken by all map tasks=1241500
        Map-Reduce Framework
                Map input records=12
                Map output records=12
                Input split bytes=1800
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=211
                CPU time spent (ms)=3490
                Physical memory (bytes) snapshot=217477120
                Virtual memory (bytes) snapshot=1972985856
                Total committed heap usage (bytes)=51380224
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Transferred 2.5361 KB in 62.3328 seconds (41.6635 bytes/sec)
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Exported 12 records.
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Publishing HCatalog export job data to Listeners
21/09/12 05:58:06 WARN mapreduce.PublishJobData: Unable to publish export data to publisher org.apache.atlas.sqoop.hook.SqoopHook
java.lang.ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
        at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:457)
        at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
        at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
        at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:243)
21/09/12 05:58:06 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@4232c52b

【问题讨论】:

  • 你能不能像--hcatalog-database mydb --hcatalog-table airquality这样使用表名和数据库并删除`--export dir`?并确保hive和mysql的数据类型和数据长度是等价/相似的。
  • 另外,您可以使用 --verbose 运行以从 sqoop 获取更多信息。
  • 嗨@KoushikRoy,我刚刚更新了帖子。它将数据附加到表中(所以当运行两次时,我有重复,我认为这不是预期的,至少想要修复)并且有一个例外,但它可以完成整体工作。您可能想将其作为答案,以便我可以标记它。如果您对其他问题有任何想法,我也将不胜感激。感谢您的时间!

标签: mysql hive hdfs sqoop


【解决方案1】:

解决您的第一个问题 - --hcatalog-database mydb --hcatalog-table airquality 并删除 --export dir 参数。

Sqoop 导出无法替换数据。请在加载主表之前发出一个 sqoop eval 语句以截断它。

sqoop eval --connect conn_parameters --username xx --password yy --query "truncate table mytab;"

您也可以使用 update 语句来更新表。 https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
现在,对于您的标题问题,我认为原始表格可能有标题行。我不确定原始表中的数据。检查源表是否在 hive 中正确定义。

【讨论】:

  • 原始数据不在 hive 表中,但我可以很容易地用脚本匹配和删除它。这对我来说已经足够了,再次感谢! : ]
猜你喜欢
  • 2016-07-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-07-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多