【发布时间】:2020-02-21 18:31:46
【问题描述】:
编辑:问题似乎在于标准日期时间库将日期时间转换为 Windows 中前纪元日期时间的时间戳
请参阅以下最小示例:
import datetime
#this works
datetime.datetime(1973,1,23,0).timestamp()
#this produces OSError: [Errno 22] Invalid argument
datetime.datetime(1953,1,23,0).timestamp()
问题
当我将带有 datetime64[ns] 日期的 Pandas 数据帧转换为 Apache Spark 数据帧时,我收到了一堆关于 Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc' 的警告(下面是完整的堆栈跟踪),并且 pre-epoch 日期更改为时代。为什么会发生这种情况,我该如何预防?
软件版本
Windows 10 蟒蛇:3.7.6 pyspark 2.4.5 熊猫 1.0.1
重现代码
#imports
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession
#set up spark
spark = SparkSession.builder.getOrCreate()
#create dataframe
df = pd.DataFrame({'Dates': [datetime(2019,3,29), datetime(1953,2,20)]})
#data types
df.dtypes
"""
Result:
Dates datetime64[ns]
dtype: object
"""
#try to convert to spark
sparkdf = spark.createDataFrame(df)
堆栈跟踪
Exception ignored in: 'pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc'
Traceback (most recent call last):
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 144, in fromutc
return f(self, dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 258, in fromutc
dt_wall = self._fromutc(dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\_common.py", line 222, in _fromutc
dtoff = dt.utcoffset()
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 222, in utcoffset
if self._isdst(dt):
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 291, in _isdst
dstval = self._naive_is_dst(dt)
File "C:\Users\jbishop\AppData\Roaming\Python\Python37\site-packages\dateutil\tz\tz.py", line 260, in _naive_is_dst
return time.localtime(timestamp + time.timezone).tm_isdst
OSError: [Errno 22] Invalid argument
结果数据框
sparkdf.show()
+-------------------+
| Dates|
+-------------------+
|2019-03-29 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
数据类型
sparkdf.printSchema()
root
|-- Dates: timestamp (nullable = true)
【问题讨论】:
标签: python python-3.x pandas date apache-spark