【问题标题】:BeautifulSoup and writing to JSON fileBeautifulSoup 并写入 JSON 文件
【发布时间】:2018-10-04 18:16:19
【问题描述】:

我正在使用 BeautifulSoup 抓取一些数据,并希望将这些数据写入 json 文件。我已经能够编写脚本来将数据保存到 json 文件中,但它只保存页面上的最后一项并且不会遍历所有结果。它在终端中打印出每个结果。我不确定我错过了什么。这是我的代码

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

otl_url = 'https://open.umn.edu/opentextbooks/SearchResults.aspx?subjectAreaId=99'

#opening up connection and grabbing page
uClient = urlopen(otl_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"twothird"})

data = {}
for container in containers:
   data['title'] = container.h2.text 
   data['author'] = container.p.text
   data['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]

   print("title: " + data['title'])
   print("author: " + data['author'])
   print("link: " + data['link'])

with open("textbooks.json", "w") as writeJSON:
   json.dump(data, writeJSON, ensure_ascii=False)

【问题讨论】:

    标签: python json beautifulsoup


    【解决方案1】:

    您将数据存储在dict 中,并且它只能包含一个同名密钥。如果要存储多个,则需要使用列表,例如:

    data = []
    for container in containers:
       data.append({"title": container.h2.text, "author": container.p.text,
                    "link": "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]})
    
    with open("textbooks.json", "w") as writeJSON:
       json.dump(data, writeJSON, ensure_ascii=False)
    

    【讨论】:

      【解决方案2】:

      在你的for 循环这行:

      data['title'] = container.h2.text 
      data['author'] = container.p.text
      data['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]
      

      在循环的每次迭代中重置字典的值。我建议您将它们列成这样的列表:

      data['title'] = []
      data['author'] = []
      data['link'] = []
      

      然后在你的for循环中有

      data["title"].append(container.h2.text)
      data["author"].append(container.p.text)
      data["link"].append("https://open.umn.edu/opentextbooks/" + container.h2.a["href"])
      

      这将保存所有找到的容器,您应该会看到 JSON 文件中的所有内容。

      希望这会有所帮助!

      【讨论】:

        【解决方案3】:

        这是因为您在循环的每次迭代中都重新分配了 data 对象。你可能想要更多这样的东西:

        data = [] # create a list to store the items
        for container in containers:
            item = {}
            item['title'] = container.h2.text
            item['author'] = container.p.text
            item['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]
            data.append(item) # add the item to the list
        
            print("title: " + item['title'])
            print("author: " + item['author'])
            print("link: " + item['link'])
        
        with open("textbooks.json", "w") as writeJSON:
            json.dump(items, writeJSON, ensure_ascii=False)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2014-10-06
          • 1970-01-01
          • 2018-10-09
          • 1970-01-01
          • 2014-02-02
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多