【问题标题】:Python: Saving AJAX response data to .json and save this to pandas DataFramePython:将 AJAX 响应数据保存到 .json 并将其保存到 pandas DataFrame
【发布时间】:2019-03-17 16:42:30
【问题描述】:

您好,感谢您抽出宝贵时间阅读本文,

我希望从特定证券交易所提取公司信息,然后将此信息保存到 pandas DataFrame。 每家公司都有自己的网页,这些网页都由“KodeEmiten”结尾决定。这些代码保存在第一个 Dataframe 的列中:
df = pd.DataFrame.from_dict(data['data'])

现在我的目标是使用这些代码分别调用每个公司的网站,并为每个网站创建一个 json 文件

for i in range (len(df)): 
 requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()  

虽然这可行,但我无法将其保存到新的 DataFrame 到期列表索引超出范围和不正确的关键字错误。 xhr 中的信息比我实际需要的要多得多,而且我认为不同的结构会导致尝试将它们保存到新的 DataFrame 时出错。我真的只是对获取这些 xhr 标头中的数据感兴趣:
AnakPerusahaan:, Direktur:, Komisaris, PemegangSaham:

所以我的问题有点二合一:
a) 我怎样才能从那些特定的 xhr 标头中提取信息(它们都是表格)
b)我怎样才能将它们保存到一个新的数据框(甚至是我不介意的列表)

import requests
import pandas as pd
import json
import time

# gets broad data of main page of the stock exchange
sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')

data = sxow.json() # save the request as .json file
df = pd.DataFrame.from_dict(data['data']) #creates DataFrame based on the data (.json) file


# add: compare file contents and overwrite original if same

cdate = time.strftime ("%Y%m%d") # creating string-variable w/ current date year|month|day
df.to_excel(f"{cdate}StockExchange_Overview.xlsx") # converts DataFrame to Excel file, can't overwrite existing file


for i in range (len(df)) :
    requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()

#This is where I'm completely stuck

【问题讨论】:

    标签: python json pandas dataframe


    【解决方案1】:

    您无需将结果转换为数据框。您只需遍历 json 对象并连接 url 即可获取其他公司网站的详细信息。

    按照下面的代码:

    import requests
    import pandas as pd
    import json
    import time
    
    # gets broad data of main page of the stock exchange
    sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
    
    data = sxow.json() # save the request as .json file
    
    list_of_json = []
    for nested_json in data['data']:
        list_of_json.append(requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten='+nested_json['KodeEmiten']).json())
        time.sleep(1)
    

    list_of_json 将包含您请求的所有 json 结果。

    这里nested_json是循环变量,用于循环遍历不同KodeEmiten的json数组。

    【讨论】:

    • 我不太明白它是如何工作的,但它工作得很好,非常感谢!但是 each_json 到底是什么,因为它没有在任何地方定义?
    • 我已经编辑了代码并放了一些cmets。 nested_json 是一个循环变量。请接受答案,因为它已经解决了您的问题
    • @bigbounty 能否请您帮助解决以下问题stackoverflow.com/questions/54865312/…
    【解决方案2】:

    这是对@bigbounty 方法的轻微改进:
    由于目的是将信息保存到列表中,然后在脚本中进一步使用所述列表列表理解实际上要快一点。

    list_of_json = [requests.get('url+nested_json["KodeEmiten"]).json() for nested_json in data["data"]]'
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-11-04
      • 2019-11-16
      • 2014-11-09
      • 1970-01-01
      • 2017-06-16
      • 1970-01-01
      相关资源
      最近更新 更多