【发布时间】:2015-06-25 20:05:39
【问题描述】:
我正在使用 Python 2.7、requests 和 BeautifulSoup 来抓取大约 50 个维基百科页面。我在我的数据框中创建了一个列,其中包含与每首歌曲的名称相关的部分 URL(这些 URL 之前已经过验证,并且在针对所有歌曲进行测试时得到响应代码 200)。
我的代码循环遍历这些单独的 URL 并将其附加到主 Wikipedia URL。我已经能够获得页面的标题或其他数据,但我真正想要的只是歌曲的长度(不需要其他所有内容)。歌曲长度包含在信息框中(例如:http://en.wikipedia.org/wiki/No_One_Knows)
我的代码要么拖过页面上的所有内容,要么什么都没有。我认为主要问题是我在下面加下划线的地方(即 mt = ...) - 我在这里放了不同的 html 标签,但我要么一无所获,要么页面的大部分内容。
xyz = df.lengthlink
#column in a dataframe containing partial strings to append to the main Wikipedia url
def songlength():
url = ('http://en.wikipedia.org/wiki/' + xyz)
resp = requests.get(url)
page = resp.content
take = BeautifulSoup(page)
mt = take.find_all(____________)
sign = mt
return xyz, sign
for xyz in df.lengthlink:
print songlength()
编辑添加: 使用以下 Martijn 的建议适用于单个 url(即 No_One_Knows),但不适用于我的多个链接。它抛出了这个随机错误。
InvalidSchema Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
2 xyz = df.lengthlink
3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
5 page = resp.text
6
C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
63
64 kwargs.setdefault('allow_redirects', True)
---> 65 return request('get', url, **kwargs)
66
67
C:\Python27\lib\site-packages\requests\api.pyc in request(method, url, **kwargs)
47
48 session = sessions.Session()
---> 49 response = session.request(method=method, url=url, **kwargs)
50 # By explicitly closing the session, we avoid leaving sockets open which
51 # can trigger a ResourceWarning in some cases, and look like a memory leak
C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
459 }
460 send_kwargs.update(settings)
--> 461 resp = self.send(prep, **send_kwargs)
462
463 return resp
C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
565
566 # Get the appropriate adapter to use
--> 567 adapter = self.get_adapter(url=request.url)
568
569 # Start time (approximately) of the request
C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
644
645 # Nothing matches :-/
--> 646 raise InvalidSchema("No connection adapters were found for '%s'" % url)
647
648 def close(self):
InvalidSchema: No connection adapters were found for '1 http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
2 http://en.wikipedia.org/wiki/No_One_Knows
3 http://en.wikipedia.org/wiki/Given_to_Fly
4 http://en.wikipedia.org/wiki/Nothing_as_It_Seems
Name: lengthlink, Length: 50, dtype: object'
【问题讨论】:
标签: python python-2.7 beautifulsoup python-requests