【问题标题】:Extracting data from Wikipedia table (titles of episodes)从 Wikipedia 表中提取数据(剧集标题)
【发布时间】:2014-11-11 00:14:48
【问题描述】:

我正在尝试使用 BeautifulSoup 和 Python 从维基百科的表格中提取电视剧集的标题。 为了解释我到目前为止所做的事情,我使用了两个表:

1:http://en.wikipedia.org/wiki/Community_(season_1)

2:http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)

现在,在表格中,每一集都包含在<td class="summary"> 中。 在第一个表中,<td> 也有一个<a>TitleName</a>,我可以使用以下代码很好地提取数据:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Community_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

for names in soup.select('td[class="summary"] > a'):
    print names.string

但问题出现在第二张桌子上,即两个半男人,其中标题在 <td> 内 我使用这段代码来提取它们:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.string

但瓷砖带有引号,即“”。 我猜想删除引号会很容易,但是如果在一张表中,一些<td> 包含<a> 而有些不包含呢?如何让 python 决定它是否应该检查 <a> 元素?

如果在第一个代码块中,我删除了 > a ,那么我将得到 none 作为输出,因为父级和子级都包含字符串。如果我继续使用 names.strings 我得到 ​​p>

<generator object _all_strings at 0x01B1CDA0>

如果我使用soup.get_text(),我会得到 UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 6818, character maps to <undefined>

请帮忙:)

【问题讨论】:

标签: python web-scraping beautifulsoup html-table wikipedia


【解决方案1】:

.text怎么样?

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.text.replace('"','') # remove the quote marks as well

这将打印所有不带引号的名称,并修复None 问题。

Pilot
Most Chicks Won't Eat Veal
Big Flappy Bastards
etc...

【讨论】:

  • 哇,效果很好!感谢您的快速响应,我不知道文本方法,我想我需要多学习一下 BeautifulSoup!
【解决方案2】:

你有没有想过尝试 tvrage API?

import tvrage.api
community = tvrage.api.Show('Community')
twohalfmen = tvrage.api.Show('Two and a Half Men')
comeps = community.season(1).episode(1)
twoeps = twohalfmen.season(1).episode(2)
>>> comeps
Community 1x01 Pilot
>>> twoeps
Two and a Half Men 1x02 Big Flappy Bastards
>>> community.season(1)
{1: Community 1x01 Pilot, 2: Community 1x02 Spanish 101, 3: Community 1x03 Introduction to Film,
4: Community 1x04 Social Psychology, 5: Community 1x05 Advanced Criminal Law, 6: Community 1x06 Football, Feminism and You,
7: Community 1x07 Introduction to Statistics, 8: Community 1x08 Home Economics, 9: Community 1x09 Debate 109, 10: Community 1x10 Environmental Science,
11: Community 1x11 The Politics of Human Sexuality, 12: Community 1x12 Comparative Religion, 13: Community 1x13 Investigative Journalism, 14: Community 1x14 Interpretive Dance, 15: Community 1x15 Romantic Expressionism, 16: Community 1x16 Communication Studies, 17: Community 1x17 Physical Education, 18: Community 1x18 Basic Genealogy, 19: Community 1x19 Beginner Pottery, 20: Community 1x20 The Science of Illusion, 21: Community 1x21 Contemporary American Poultry, 22: Community 1x22 The Art of Discourse, 23: Community 1x23 Modern Warfare, 24: Community 1x24 English as a Second Language, 25: Community 1x25 Pascal's Triangle Revisited}

【讨论】:

  • 你的回答也很好,实际上是我正在做的更好的方法,但选择了另一个被接受的方法,因为我现在也可以将 BeautifulSoup 用于其他东西......
猜你喜欢
  • 2013-06-14
  • 1970-01-01
  • 1970-01-01
  • 2015-07-12
  • 1970-01-01
  • 2012-10-21
  • 2012-02-04
  • 1970-01-01
  • 2018-05-11
相关资源
最近更新 更多