【问题标题】:search a specific word in BeautifulSoup python在 BeautifulSoup python 中搜索特定单词
【发布时间】:2015-09-15 01:43:10
【问题描述】:

我正在尝试制作一个 python 脚本来读取 crunchyroll 的页面并为我提供字幕的 ssid。

例如:-http://www.crunchyroll.com/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035

去源码找ssid,我要提取这个元素ssid后面的数字

 <a href="/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035?ssid=154757" title="English (US)">English (US)</a>

我想提取“154757”,但我的脚本似乎无法运行

这是我当前的脚本:

import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup


feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')

如何修改我的代码以搜索和提取该特定数字?

【问题讨论】:

  • 您在此处提出了多个问题,但未接受任何答案。如果您先接受之前问题的答案,其他用户会更愿意提供帮助。
  • @serk .. 完成...我找不到如何接受它们...相反,我赞成他们:|
  • 欢迎来到stackoverflow!我建议您使用tour

标签: python string python-2.7 beautifulsoup text-extraction


【解决方案1】:

这应该让您开始能够为每个条目提取ssid。请注意,其中一些链接没有任何ssid,因此您必须通过一些错误捕获来解决这一问题。这里不需要reurllib2 模块。

import feedparser
import requests
from bs4 import BeautifulSoup


d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
    #print url.link
    r = requests.get(url.link)
    soup = BeautifulSoup(r.text)
    #print soup
    subtitles = soup.find_all('span',{'class':'showmedia-subtitle-text'})
    for ssid in subtitles:
        x = ssid.findAll('a')
        for a in x:
            print a['href']

输出:

--snip--
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981
--snip--

还有更多,但为了简洁起见,我将它们省略了。从这些结果中,您应该能够通过一些切片轻松解析出ssid,因为看起来 ssid 都是 6 位数长。做类似的事情:

print a['href'][-6:]

会做的伎俩,让你只是ssid

【讨论】:

  • getting :- NameError: name 'requests' is not defined line r = requests.get(url.link) is the source of this error ....
  • 你必须安装 requests 模块。
  • 好的.. 所以,它开始给我来自 rss 提要的标题... 只需要弄清楚 ssids.. 谢谢你让我开始... 将尝试回复几个小时..
  • ssid 出现在结果中(例如:/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057 ?ssid=166035) 。您必须将它们解析出来,这应该不会太难。这能回答你的问题吗?
  • @user2408212 查看我对答案所做的更新,以反映如何从结果中获取ssid
猜你喜欢
  • 1970-01-01
  • 2017-08-28
  • 1970-01-01
  • 2015-05-22
  • 1970-01-01
  • 2016-05-20
  • 1970-01-01
  • 1970-01-01
  • 2019-12-29
相关资源
最近更新 更多