如何在 Julia 中使用 PyCall 将 Python 输出转换为 Julia DataFrame答案

【问题标题】：How to use PyCall in Julia to convert Python output to Julia DataFrame如何在 Julia 中使用 PyCall 将 Python 输出转换为 Julia DataFrame
【发布时间】：2017-07-25 18:19:58
【问题描述】：

我想从quandl 检索一些数据并在 Julia 中分析它们。不幸的是，没有可用的官方 API（目前）。我知道this solution，但它的功能仍然非常有限，并且不遵循与原始 Python API 相同的语法。

我认为使用 PyCall 从 Julia 中使用官方 Python API 检索数据是一件聪明的事情。这确实会产生输出，但我不确定如何将其转换为可以在 Julia 中使用的格式（最好是 DataFrame）。

我已经尝试了以下方法。

using PyCall, DataFrames
@pyimport quandl

data = quandl.get("WIKI/AAPL", returns = "pandas");

Julia 将此输出转换为 Dict{Any,Any}。当使用returns = "numpy" 而不是returns = "pandas" 时，我最终得到一个PyObject rec.array。

我怎样才能让data 成为 Julia，DataFrame 因为quandl.jl 会返回它？请注意，quandl.jl 不适合我，因为它不支持自动检索多个资产并且缺少其他一些功能，因此我必须使用 Python API。

感谢您的任何建议！

【问题讨论】：

标签： python dataframe julia quandl

【解决方案1】：

这是一种选择：

首先，从data对象中提取列名：

julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
 :Open                
 :High                
 :Low                 
 :Close               
 :Volume              
 Symbol("Ex-Dividend")
 Symbol("Split Ratio")
 Symbol("Adj. Open")  
 Symbol("Adj. High")  
 Symbol("Adj. Low")   
 Symbol("Adj. Close") 
 Symbol("Adj. Volume")

然后将所有列倒入 DataFrame：

julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)

6×12 DataFrames.DataFrame
│ Row │ Open  │ High  │ Low   │ Close │ Volume   │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1   │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0         │ 1.0         │
│ 2   │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0         │ 1.0         │
│ 3   │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0         │ 1.0         │
│ 4   │ 25.87 │ 26.0  │ 25.87 │ 25.87 │ 385900.0 │ 0.0         │ 1.0         │
│ 5   │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0         │ 1.0         │
│ 6   │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0         │ 1.0         │

│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1   │ 0.428364  │ 0.430152  │ 0.428364 │ 0.428364   │ 1.17258e8   │
│ 2   │ 0.407952  │ 0.407952  │ 0.406015 │ 0.406015   │ 4.39712e7   │
│ 3   │ 0.378004  │ 0.378004  │ 0.376216 │ 0.376216   │ 2.6432e7    │
│ 4   │ 0.385453  │ 0.38739   │ 0.385453 │ 0.385453   │ 2.16104e7   │
│ 5   │ 0.396777  │ 0.398565  │ 0.396777 │ 0.396777   │ 1.83624e7   │
│ 6   │ 0.420914  │ 0.422851  │ 0.420914 │ 0.420914   │ 1.21576e7   │

感谢@Matt B. 提出简化代码的建议。

上面的问题是数据框内的列类型是Any。为了使它更高效，这里有一些可以完成工作的函数：

# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
  string_dtype = x[:dtype][:name]
  julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])

  return eval(parse("$julia_string"))
end

# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
  y = try Array{guess_type(x)}(x) catch Array(x) end
  return y
end

# put everything together into a single function
function convert_pandas(df)
  colnames =  map(Symbol, data[:columns])
  y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)

  return y
end

以上内容，当应用于您的data 时，会给出与以前相同的列名，但具有正确的Float64 列类型：

y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype  │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1     │ Open        │ Float64 │ 0       │
│ 2     │ High        │ Float64 │ 0       │
│ 3     │ Low         │ Float64 │ 0       │
│ 4     │ Close       │ Float64 │ 0       │
│ 5     │ Volume      │ Float64 │ 0       │
│ 6     │ Ex-Dividend │ Float64 │ 0       │
│ 7     │ Split Ratio │ Float64 │ 0       │
│ 8     │ Adj. Open   │ Float64 │ 0       │
│ 9     │ Adj. High   │ Float64 │ 0       │
│ 10    │ Adj. Low    │ Float64 │ 0       │
│ 11    │ Adj. Close  │ Float64 │ 0       │
│ 12    │ Adj. Volume │ Float64 │ 0       │

【讨论】：

您似乎无法使用符号索引到 Dict{Any}{Any} 对象。我尝试改用字符串；我认为这在最近的版本中可能已经改变，但一旦我弄清楚类型转换应该可以工作。 Array(data[colname]) 返回 MethodError: Cannot convert an object of type Dict{Any}{Any} to an object of type Array{T}{N}。我使用的是 0.5.0 版。
我确实将列名转换为符号，使用colnames = map(x -> Symbol(String(x)), data[:columns][:values]) 你也这样做了吗？它在我的机器上运行良好，我也在使用 0.5。
您使用的是什么版本的 PyCall 和 DataFrames？这应该工作得很好。它可以稍微简单一些，并添加列名：cols = map(Symbol, data[:columns]); DataFrame(Any[Array(data[c]) for c in cols], cols)
@Constantin 我使用上述建议编辑了答案并添加了功能以使转换更简单。
map(Symbol, data[:columns]) 返回EEROR: KeyError: key :columns not found。我尝试了map(Symbol, keys(data))，它返回一个Symbol 数组，我不能用它来索引Dict。我尝试colnames = keys(data) 使用字符串而不是符号来索引Dict，但这又给了我上面的MethodError。我很困惑为什么这会在你的机器上而不是我的机器上工作。你有没有加载任何额外的包？

【解决方案2】：

您遇到了 Python/Pandas 版本的差异。我碰巧有两种配置可供我轻松使用； Python 2 中的 Pandas 0.18.0 和 Python 3 中的 Pandas 0.19.1。@niczky12 提供的答案在第一个配置中运行良好，但我在第二个配置中看到了您的 Dict{Any,Any} 行为。基本上，这两种配置之间发生了一些变化，例如 PyCall 检测到 Pandas 对象的类似映射的接口，然后通过自动转换将该接口公开为字典。这里有两种选择：

使用字典界面：

data = quandl.get("WIKI/AAPL", returns = "pandas")
cols = keys(data)
df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))

显式禁用自动转换并使用 PyCall 接口将列提取为niczky12 demonstrated in the other answer。请注意，data[:Open] 会自动转换为映射字典，data["Open"] 只会返回 PyObject。
```
data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
cols = data[:columns]
df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
```

不过，请注意，在这两种情况下，最重要的日期索引都不包含在结果数据框中。您几乎肯定希望将其添加为列：

df[:Date] = collect(data[:index])

【讨论】：

实际上还有第三个选项——禁用自动转换，然后用 Pandas.jl 库包装PyObject。这不会让您进入 JuliaStats 生态系统，但它简化了使用 Julia 内部的 Pandas 分析函数。
好地方！我感觉这个问题是由于 Python2 与 Python3 造成的。另外，我同意应该包含数据索引。

【解决方案3】：

有一个 API。只需使用 Quandl.jl：https://github.com/milktrader/Quandl.jl

using Quandl
data = quandlget("WIKI/AAPL")

这具有以有用的 Julia 格式（TimeArray）获取数据的额外优势，该格式具有为处理此类数据定义的适当方法。

【讨论】：

感谢您的回答。我知道这个选项（见我的问题），但这是一个非官方的 API，它的功能非常有限。例如，我没有设法有选择地检索多个系列并将它们返回到日期匹配的 DataFrame 对象中。官方的 Python API 支持这一点，所以我想为它构建一个包装器，而不是使用非官方的 API。
哦，我没有意识到这是你链接到的。好的，那么我希望您可以使用您从其他答案中获得的称职建议。