【问题标题】:Python Pandas - Cannot Merge Multiple DataFrame returning NaNPython Pandas - 无法合并返回 NaN 的多个 DataFrame
【发布时间】:2016-09-21 09:29:25
【问题描述】:

我正在尝试将多个 CSV 文件合并到 1 个大型数据帧中。我想将它们与日期列合并。尽管某些 CSV 文件缺少日期,并且需要记录空白或 NA。

四处搜索让我相信 python 中的 pandas 将是一个可行的解决方案。

我的代码如下:

import pandas as pd

AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
AvgPrice = AvgPrice.iloc[:,(0,1)]
AvgPrice.columns.values[1] = 'Price'

TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
TransVol.columns.values[1] = 'TransactionVolume'

TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
TotalBTC.columns.values[1] = 'TotalBTC'

USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
USDExchVol.columns.values[1] = 'USDExchange Volume'

df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer)

df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')

CSV 文件位于此处:https://drive.google.com/folderview?id=0B8xdmDmZgtJbVkhCcjZkZUhaajg&usp=sharing

df_test 的结果:

            Date   Price  TransactionVolume
0     2016-05-10  459.30                NaN
1     2016-05-09  462.49                NaN
2     2016-05-08  461.85                NaN
3     2016-05-07  460.86                NaN
4     2016-05-06  453.51                NaN
5     2016-05-05  449.31                NaN

而 df1 似乎还不错:

            Date  TransactionVolume   Price
0     2016-05-10           275352.0  459.30
1     2016-05-09           256585.0  462.49
2     2016-05-08           152045.0  461.85
3     2016-05-07           245115.0  460.86
4     2016-05-06           264882.0  453.51
5     2016-05-05           273005.0  449.31

我不知道为什么 df2 和 df_test 最右边的列填充了 NaN。这限制了我将 df1 和 df2 合并为一个大的 DataFrame。

任何帮助将不胜感激,因为我已经花了几个小时没有成功。

【问题讨论】:

    标签: python csv pandas merge


    【解决方案1】:

    你必须将参数namesusecols添加到read_csv,然后它就可以正常工作了:

    import pandas as pd
    
    AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv', 
                           index_col=False, 
                           parse_dates=['Date'],
                           usecols=[0,1],
                           header=0, 
                           names=['Date','Price'])
    
    TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', 
                           index_col=False, 
                           parse_dates=['Date'],
                           header=0, 
                           names=['Date','TransactionVolume'])
    
    
    TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv', 
                           index_col=False, 
                           parse_dates=['Date'],
                           header=0, 
                           names=['Date','TotalBTC'])
    
    
    USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', 
                           index_col=False,
                           parse_dates=['Date'],
                           header=0, 
                           names=['Date','USDExchange Volume'])
    
    df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
    df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
    df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
    
    print (df1.head())
    print (df2.head())
    print (df_test.head())
            Date  TransactionVolume   Price
    0 2016-05-10           275352.0  459.30
    1 2016-05-09           256585.0  462.49
    2 2016-05-08           152045.0  461.85
    3 2016-05-07           245115.0  460.86
    4 2016-05-06           264882.0  453.51
            Date  USDExchange Volume    TotalBTC
    0 2016-05-10        2.158373e+06  15529625.0
    1 2016-05-09        1.438420e+06  15525825.0
    2 2016-05-08        6.679933e+05  15521275.0
    3 2016-05-07        1.825475e+06  15517400.0
    4 2016-05-06        1.908048e+06  15513525.0
            Date   Price  TransactionVolume
    0 2016-05-10  459.30           275352.0
    1 2016-05-09  462.49           256585.0
    2 2016-05-08  461.85           152045.0
    3 2016-05-07  460.86           245115.0
    4 2016-05-06  453.51           264882.0
    

    通过评论编辑:

    我认为您可以将monthsDate to_period 列转换,然后将groupbymean 一起使用:

    print (df1.Date.dt.to_period('M'))
    0      2016-05
    1      2016-05
    2      2016-05
    3      2016-05
    4      2016-05
    5      2016-05
    6      2016-05
    7      2016-05
    ...
    ...
    print (df1.groupby( df1.Date.dt.to_period('M') ).mean() )
             TransactionVolume       Price
    Date                                  
    2011-05       1.605518e+05    7.272273
    2011-06       1.739163e+05   17.914583
    2011-07       6.647129e+04   14.100645
    2011-08       1.050460e+05   10.089677
    2011-09       9.562243e+04    5.933667
    2011-10       9.120232e+04    3.638065
    2011-11       8.927442e+05    2.690333
    2011-12       1.092328e+06    3.463871
    2012-01       1.168704e+05    6.105161
    2012-02       1.465859e+05    5.115517
    ...
    ...
    

    如果顺序很重要,添加参数sort=False

    print (df1.groupby( df1.Date.dt.to_period('M') , sort=False).mean() )
             TransactionVolume       Price
    Date                                  
    2016-05       2.511146e+05  454.544000
    2016-04       2.747255e+05  435.102333
    2016-03       3.142206e+05  418.208710
    2016-02       3.402811e+05  404.091379
    2016-01       2.548778e+05  412.671935
    2015-12       3.857985e+05  423.402903
    2015-11       4.290366e+05  349.200333
    2015-10       3.134802e+05  266.007097
    2015-09       2.572308e+05  235.310345
    2015-08       2.737384e+05  253.951613
    ...
    ...
    

    【讨论】:

    • 非常感谢 :) 另一个我觉得很难的题外话是说我想对 df1 做一个总结并找到每个月的平均值。我怎么能这样说 df1.groupby( TimeGrouper('1M') ) 等。我不确定正确的语法是什么。
    • 你认为yearmonth 在一起,对吗?还是所有 1 月、2 月的平均值?
    • 甜蜜正是我想要的。非常感谢!
    • 有趣。您的解决方案StackOverflow 被接受,但没有green 标记。请参阅11 行:stackoverflow.com/users/2901002/jezrael?tab=reputation。也许是错误。你接受还是不接受?
    【解决方案2】:

    这里有一个微妙的错误,您通过直接分配给每个 df 中的列数组来重命名列:

    AvgPrice.columns.values[1] = 'Price'
    

    如果您尝试TransVol.info(),它会在TransactionVolume 上引发KeyError

    如果您改为使用rename,那么它可以工作:

    In [35]:
    AvgPrice = pd.read_csv(r'c:\data\BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
    AvgPrice = AvgPrice.iloc[:,(0,1)]
    AvgPrice.rename(columns={'24h Average':'Price'}, inplace=True)
    ​
    TransVol = pd.read_csv(r'c:\data\BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
    TransVol.rename(columns={'Value':'TransactionVolume'}, inplace=True)
    ​
    TotalBTC = pd.read_csv(r'c:\data\BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
    TotalBTC.rename(columns={'Value':'TotalBTC'}, inplace=True)
    ​
    USDExchVol = pd.read_csv(r'c:\data\BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
    USDExchVol.rename(columns={'Value':'USDExchange Volume'}, inplace=True)
    ​
    df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
    df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
    ​
    df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
    df_test
    Out[35]:
                Date   Price  TransactionVolume
    0     2016-05-10  459.30           275352.0
    1     2016-05-09  462.49           256585.0
    2     2016-05-08  461.85           152045.0
    3     2016-05-07  460.86           245115.0
    4     2016-05-06  453.51           264882.0
    5     2016-05-05  449.31           273005.0
    6     2016-05-04  449.32           370911.0
    7     2016-05-03  447.93           252534.0
    8     2016-05-02  448.00           249926.0
    9     2016-05-01  452.87           170791.0
    10    2016-04-30  454.88           190470.0
    11    2016-04-29  451.88           278893.0
    12    2016-04-28  445.80           329924.0
    13    2016-04-27  461.92           335750.0
    14    2016-04-26  465.91           344162.0
    15    2016-04-25  460.32           307790.0
    16    2016-04-24  455.53           188499.0
    17    2016-04-23  449.13           203792.0
    18    2016-04-22  447.73           291487.0
    19    2016-04-21  445.28           316159.0
    20    2016-04-20  438.98           302380.0
    21    2016-04-19  432.35           275994.0
    22    2016-04-18  429.76           245313.0
    23    2016-04-17  431.93           186607.0
    24    2016-04-16  432.86           200628.0
    25    2016-04-15  429.06           281389.0
    26    2016-04-14  426.21           274524.0
    27    2016-04-13  425.50           309995.0
    28    2016-04-12  426.15           341372.0
    29    2016-04-11  422.91           264357.0
    ...          ...     ...                ...
    1798  2011-05-18    7.14            80290.0
    1799  2011-05-17    7.52           138205.0
    1800  2011-05-16    7.77            62341.0
    1801  2011-05-15    6.74           272130.0
    1802  2011-05-14    7.86           656162.0
    1803  2011-05-13    7.48           324020.0
    1804  2011-05-12    5.83           101674.0
    1805  2011-05-11    5.35           114243.0
    1806  2011-05-10    4.74           104592.0
    1807  2015-09-03     NaN           256023.0
    1808  2015-02-03     NaN           213538.0
    1809  2015-01-07     NaN           256344.0
    1810  2014-11-21     NaN           161082.0
    1811  2014-10-17     NaN           142251.0
    1812  2014-09-28     NaN            92933.0
    1813  2014-09-09     NaN           111317.0
    1814  2014-08-05     NaN           136298.0
    1815  2014-08-03     NaN            49181.0
    1816  2014-08-01     NaN           166173.0
    1817  2014-06-03     NaN           124768.0
    1818  2014-06-02     NaN            87513.0
    1819  2014-05-09     NaN            80315.0
    1820  2013-10-27     NaN           107717.0
    1821  2013-09-17     NaN           137920.0
    1822  2011-06-25     NaN           110463.0
    1823  2011-06-24     NaN           106146.0
    1824  2011-06-23     NaN           475995.0
    1825  2011-06-22     NaN           122507.0
    1826  2011-06-21     NaN           114264.0
    1827  2011-06-20     NaN           836861.0
    
    [1828 rows x 3 columns]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-05-20
      • 1970-01-01
      • 2017-08-08
      • 2019-12-07
      • 2017-08-08
      • 2019-09-10
      • 2019-08-09
      相关资源
      最近更新 更多