【问题标题】:Pandas JSON Nesting熊猫 JSON 嵌套
【发布时间】:2018-05-05 23:13:48
【问题描述】:

目前我在 Pandas 数据框中有一个抓取的 url 表。目的是吐出嵌套的 json 输出,并使用 groupby() 和 Lambda 函数,我几乎得到了我正在寻找的东西。我一直在学习这个,所以可能不是很好的代码。

{
"Field (Discovery)": "33/9-6 DELTA",
"NPDID information carrier": 44576,
"MonthlyProduction": [
  {
    "yyyymm": "2009.07.0",
    "Oil - saleable [mill Sm3]": 0.00025,
    "Gas - saleable [bill Sm3]": 0,
    "NGL - saleable [mill Sm3]": -0.00004,
    "Condensate - saleable [mill Sm3]": 0,
    "Oil equivalents - saleable [mill Sm3]": 0.00021,
    "Water - wellbores [mill Sm3]": 0.00051
  }

我正在寻找的是如何将 JSON 的嵌套部分进一步分解,从而获取列和“yyyymm”下方的值并将其嵌套如下:

{
"Field (Discovery)": "33/9-6 DELTA",
"NPDID information carrier": 44576,
"MonthlyProduction": [
  {
    "yyyymm": "2009.07.0",
    "Oil – saleable: [
        {
         "Value":0.00025,
         "Unit":  mill Sm3,
        }
       ]
    "Gas - saleable":[
        {
        "Value": 0,
        "Unit":  bill Sm3,
        }
       ]
        "NGL - saleable ": -0.00004, etc
        "Condensate - saleable [mill Sm3]": 0, etc

代码:

import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime as dt
import datetime
import pandas as pd

starttime = dt.now()

#Agent detail to prevent scraping bot detection
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
header = {'User-Agent' : user_agent }


# Webpage connection
html ="http://factpages.npd.no/ReportServer?/FactPages/TableView/
field_production_monthly&rs:Command=Render&rc:Toolbar=false&
rc:Parameters=f&Top100=False&IpAddress=108.171.128.174&CultureCode=en"

r=requests.get(html, headers=header)
c=r.content
soup=BeautifulSoup(c,"html.parser")

table = soup.find('table', attrs={'class':'a133'})

#Pandas dataframe 
df = pd.read_html(str(table), header=0)[0]
df['yyyymm'] = df['Year'].map(str)+df['Month'].map(str)
#df['NPDID information carrier'].astype(int)

df.info()

result = (df.groupby(["Field (Discovery)","NPDID information carrier"], 
as_index=False)
         .apply(lambda x: x[[ 'yyyymm','Oil - saleable [mill Sm3]','Gas - 
         saleable [bill Sm3]','NGL - saleable [mill Sm3]','Condensate - 
         saleable [mill Sm3]','Oil equivalents - saleable [mill Sm3]','Water 
         - wellbores [mill Sm3]' ]].to_dict('r'))
         .reset_index()
         .rename(columns={0: 'MonthlyProduction'})
         .to_json(orient='records'))

#print(result)
#print(json.dumps(json.loads(result), indent=2, sort_keys=True))

#Time
runtime = dt.now() - starttime
print(runtime)

【问题讨论】:

    标签: python json pandas nested pandas-groupby


    【解决方案1】:

    我认为你需要:

    #define columns names
    c1 = ["Field (Discovery)","NPDID information carrier"]
    c2 = ['Oil - saleable [mill Sm3]', 
          'Gas - saleable [bill Sm3]', 
          'NGL - saleable [mill Sm3]',
          'Condensate - saleable [mill Sm3]', 
          'Oil equivalents - saleable [mill Sm3]', 
          'Water - wellbores [mill Sm3]']
    
    #change values to dictionaries
    def f(x):
        a = x.name.split('[')[1].strip(']')
        return list(zip([{'Unit': a}]*len(x),x))
    
    df[c2] = df[c2].applymap(lambda x: {'Value': x}).apply(f)
    
    #rename columns for remove `[]`
    d = dict(zip(df[c2].columns, df[c2].columns.str.split('\s+\[').str[0]))
    df = df.rename(columns=d)
    
    #a bit improve your solution
    j = (df.groupby(c1)
           .apply(lambda x: x[['yyyymm'] + list(d.values())].to_dict('r'))
           .reset_index(name='MonthlyProduction')
           .to_json(orient='records'))
    

    编辑:

    def f(x):
        a = x.name.split('[')[1].strip(']')
        return [({'Unit': a, 'Value': i})  for i in x]
    
    df[c2] = df[c2].apply(f)
    
    #rename columns for remove `[]`
    d = dict(zip(df[c2].columns, df[c2].columns.str.split('\s+\[').str[0]))
    df = df.rename(columns=d)
    #print (df.head())
    
    
    #a bit improve your solution
    j = (df.groupby(c1)
           .apply(lambda x: x[['yyyymm'] + list(d.values())].to_dict('r'))
           .reset_index(name='MonthlyProduction')
           .to_json(orient='records'))
    

    【讨论】:

    • 超级,很高兴能帮上忙!
    • 一直在使用它,现在了解它是如何工作的等等。有没有一种简单的方法来更改返回列表“单位”和附加“值”的 lambda 所以而不是 "MonthlyProduction": [{"yyyymm": "2009.03.0", "Oil - saleable": [ { "Unit": "mill Sm3" }, {"Value": 80 } ] 它可能是"MonthlyProduction": [{"yyyymm": "2009.03.0", "Oil - saleable": [{"unit":"mill Sm3","value":"80"}
    • 是的,当然,只需将list(zip([{'Unit': a}]*len(x),x)) 更改为list(zip([{'unit': a}]*len(x),x))df[c2].applymap(lambda x: {'Value': x}).apply(f) 更改为df[c2].applymap(lambda x: {'value': x}).apply(f)
    • 抱歉不是很清楚意味着更多的字典 [ { "Unit": "mill Sm3" }, {"Value": 80 } ][{"Unit":"mill Sm3","Value":"80"} 有点像“单位”在同一个字典中的“值”?我的描述可能在技术上不正确。猜猜它发生在applymap(lambda x: {'Value': x}).apply(f) 区域附近
    • 我认为现在它工作得很好,请检查编辑的答案。
    猜你喜欢
    • 1970-01-01
    • 2022-07-06
    • 2014-08-13
    • 2017-03-28
    • 2015-10-11
    • 2020-12-28
    • 2017-11-28
    • 1970-01-01
    相关资源
    最近更新 更多