使用 Python 2.7 使用 Beautiful Soup 提取和打印表头和数据答案

【问题标题】：Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7使用 Python 2.7 使用 Beautiful Soup 提取和打印表头和数据
【发布时间】：2017-08-31 19:09:09
【问题描述】：

所以我正在尝试使用 BeautifulSoup 4.0 从Michigan Department of Health and Human Services website 上的表中抓取数据，但我不知道如何正确格式化它。

我编写了下面的代码来从网站获取信息，但我不知道如何格式化它，以便在我打印它或将其另存为时它与网站上的表格具有相同的外观.txt/.csv 文件。我已经在这里和许多其他网站上寻找答案，但我不确定如何继续进行。我是一个非常初学者，所以任何帮助将不胜感激。

我的代码只打印了一个长列表，其中包含表格行或表格数据：

import urllib2
import bs4
from bs4 import BeautifulSoup

url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")

table = soup.find("table")
rows = table.find_all("tr")

for tr in rows:
    tds = tr.find_all('td')
    print tds

我正在查看的 HTML 也在下面：

<table border=0 cellpadding=3 cellspacing=0 width=640  align="center">
  <thead style="display: table-header-group;"> 
  <tr height=18  align="center"> 
     <th height=35 align="left" colspan="2">County</th>

     <th height="35" align="right">
     2005
     </th>

该部分将年份显示为标题并一直到 2015 年，然后州和县的数据进一步向下：

   <tr height="40" > 
      <th class="LeftAligned" colspan="2">Michigan</th>
 <td>
 127,518
 </td>

对于其他县，依此类推。再次感谢任何帮助。

【问题讨论】：

你所要做的就是创建一个多维数组（行 -> 列），你就可以了。
请原谅我的无知，但就代码而言，我将如何做到这一点？

标签： python html beautifulsoup html-table

【解决方案1】：

您需要将表格存储在列表中

import urllib2
import bs4
from bs4 import BeautifulSoup

url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")

table = soup.find("table")
rows = table.find_all("tr")

table_contents = []   # store your table here
for tr in rows:
    if rows.index(tr) == 0 : 
        row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]  
    else : 
        row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ] 
    if len(row_cells) > 1 : 
        table_contents += [ row_cells ]

现在table_contents 与页面上的表格具有相同的结构和数据。

【讨论】：

好的。我明白这是如何工作的。所以现在我有一堆嵌套列表，我几乎可以将第一个列表与州和县名称分开，并将它们分配到组中各自列表的“0”位置，然后去除所有额外的 \r\ n 和 \xa0 代码输出。那有意义吗？这样它会显示为：[County, 2005, 2006....2015] 等等。
是的，差不多就是这样。正如我所说，table_contents 与网站上的表格具有相同的结构和内容，您可以随意处理它。
非常感谢。我将探索这些选项。现在我有 table_headers = table_contents[0] 和 table_body = table_contents[1:99] 似乎可以很好地分开它。我还注意到在输出的末尾，“Detroit City”和“Wayne ExcludingDetroit”已经在列表中，它们的表值。知道为什么会这样吗？
由于某种原因我无法访问该网站。如果你发布你的输出，我可以看看。
最后的输出是这样的[u'', u'Detroit City', u'13,156', u'13,002', u'12,126', u'11,791', u'11,180', u'10,941', u'10,338', u'10,081', u'10,123', u'9,818', u'9,891'], [u'', u'Wayne ExcludingDetroit', u'14,216', u'14,309', u'14,230', u'13,833', u'13,467', u'13,228', u'13,388', u'13,028', u'13,489', u'13,548', u'13,581']