从 Wikipedia 表中提取数据（剧集标题）答案

【问题标题】：Extracting data from Wikipedia table (titles of episodes)从 Wikipedia 表中提取数据（剧集标题）
【发布时间】：2014-11-11 00:14:48
【问题描述】：

我正在尝试使用 BeautifulSoup 和 Python 从维基百科的表格中提取电视剧集的标题。为了解释我到目前为止所做的事情，我使用了两个表：

1：http://en.wikipedia.org/wiki/Community_(season_1)

2：http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)

现在，在表格中，每一集都包含在<td class="summary"> 中。在第一个表中，<td> 也有一个<a>TitleName</a>，我可以使用以下代码很好地提取数据：

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Community_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

for names in soup.select('td[class="summary"] > a'):
    print names.string

但问题出现在第二张桌子上，即两个半男人，其中标题在 <td> 内我使用这段代码来提取它们：

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.string

但瓷砖带有引号，即“”。我猜想删除引号会很容易，但是如果在一张表中，一些<td> 包含<a> 而有些不包含呢？如何让 python 决定它是否应该检查 <a> 元素？

如果在第一个代码块中，我删除了 > a ，那么我将得到 none 作为输出，因为父级和子级都包含字符串。如果我继续使用 names.strings 我得到 p>

<generator object _all_strings at 0x01B1CDA0>

如果我使用soup.get_text()，我会得到 UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 6818, character maps to <undefined>

请帮忙:)

【问题讨论】：

另一个建议：不要解析维基百科页面，而是解析来自 TVRage 的 xml：services.tvrage.com/feeds/full_show_info.php?sid=22589 用于社区，services.tvrage.com/feeds/full_show_info.php?sid=6454 用于 2½ Men

标签： python web-scraping beautifulsoup html-table wikipedia

【解决方案1】：

用.text怎么样？

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.text.replace('"','') # remove the quote marks as well

这将打印所有不带引号的名称，并修复None 问题。

Pilot
Most Chicks Won't Eat Veal
Big Flappy Bastards
etc...

【讨论】：

哇，效果很好！感谢您的快速响应，我不知道文本方法，我想我需要多学习一下 BeautifulSoup！

【解决方案2】：

你有没有想过尝试 tvrage API？

import tvrage.api
community = tvrage.api.Show('Community')
twohalfmen = tvrage.api.Show('Two and a Half Men')
comeps = community.season(1).episode(1)
twoeps = twohalfmen.season(1).episode(2)
>>> comeps
Community 1x01 Pilot
>>> twoeps
Two and a Half Men 1x02 Big Flappy Bastards
>>> community.season(1)
{1: Community 1x01 Pilot, 2: Community 1x02 Spanish 101, 3: Community 1x03 Introduction to Film,
4: Community 1x04 Social Psychology, 5: Community 1x05 Advanced Criminal Law, 6: Community 1x06 Football, Feminism and You,
7: Community 1x07 Introduction to Statistics, 8: Community 1x08 Home Economics, 9: Community 1x09 Debate 109, 10: Community 1x10 Environmental Science,
11: Community 1x11 The Politics of Human Sexuality, 12: Community 1x12 Comparative Religion, 13: Community 1x13 Investigative Journalism, 14: Community 1x14 Interpretive Dance, 15: Community 1x15 Romantic Expressionism, 16: Community 1x16 Communication Studies, 17: Community 1x17 Physical Education, 18: Community 1x18 Basic Genealogy, 19: Community 1x19 Beginner Pottery, 20: Community 1x20 The Science of Illusion, 21: Community 1x21 Contemporary American Poultry, 22: Community 1x22 The Art of Discourse, 23: Community 1x23 Modern Warfare, 24: Community 1x24 English as a Second Language, 25: Community 1x25 Pascal's Triangle Revisited}

【讨论】：

你的回答也很好，实际上是我正在做的更好的方法，但选择了另一个被接受的方法，因为我现在也可以将 BeautifulSoup 用于其他东西......