使用 BeautifulSoup 解析列并保存为 JSON答案

【问题标题】：Parsing columns with BeautifulSoup and saving as JSON使用 BeautifulSoup 解析列并保存为 JSON
【发布时间】：2016-08-08 10:06:07
【问题描述】：

我想解析网站上的 Afk.、Aantal 和 Zetels 列：http://www.nlverkiezingen.com/TK2012.html，我最终可以将其保存为 JSON 文件。

在将其保存为 json 文件之前，我需要解析元素。

我有

from bs4 import BeautifulSoup
import urllib

jaren = [str("2010"), str("2012")]

for Jaargetal in jaren:
    r = urllib.urlopen("http://www.nlverkiezingen.com/TK" + Jaargetal +".html").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        header = soup.find_all("h1")[0].getText()
        print header

        trs = table.find_all("tr")[0].getText()
        print '\n'
        for tr in table.find_all("tr"): 
              print "|".join([x.get_text().replace('\n','') for x in tr.find_all('td')])

我试过了

from bs4 import BeautifulSoup
import urllib

jaren = [str("2010"), str("2012")]

for Jaargetal in jaren:
    r = urllib.urlopen("http://www.nlverkiezingen.com/TK" + Jaargetal +".html").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        header = soup.find_all("h1")[0].getText()
        print header

        for tr in  table.find_all("tr"):
            firstTd = tr.find("td")
            if firstTd and firstTd.has_attr("class") and "l" in firstTd['class']:
                tds = tr.find_all("td")

                for tr in table.find_all("tr"): 
                    print "|".join([x.get_text().replace('\n','') for x in tr.find_all('td')])
                    break

我做错了什么或者我必须做什么，我在正确的轨道上吗？

【问题讨论】：

您能指出现有代码到底有什么问题吗？谢谢。
@alecxe 在第一个代码中，它打印所有行：Partij|Afk.|Aantal|%|+/-|Zetels 我想让代码只打印行：Afk。 Aantal 和 Zetels。

标签： python html json beautifulsoup bs4

【解决方案1】：

仅提取所需列的一个选项是检查列的索引。定义您感兴趣的列索引：

DESIRED_COLUMNS = {1, 2, 5}  # it is a set

然后将enumerate() 与find_all() 一起使用：

"|".join([x.get_text().replace('\n', '') 
          for index, x in enumerate(tr.find_all('td')) 
          if index in DESIRED_COLUMNS])

【讨论】：

谢谢@Alecxe 还有一个选项，我可以定义一些东西，例如只从 afk、Aantal 和 Zetels 获取前 3 行？
@Danisk 你可以随时切片，例如：tr.find_all('td')[:3]
感谢您的帮助！我还有一个简短的问题；这现在有效，唯一的问题是我从 2012 年和 2010 年开始抓取。2010 年的行数比 2012 年多，所以它也会抓取我不想要的内容。有没有办法说。 “来自 2010 年网站的 tr.find_all td [:20] 和来自 2012 年的 tr.find_all('tr')[:19]
@Danisk 对之前的评论感到抱歉。我的意思是，您不应该试图限制提取的行数 - 这是动态部分。