【问题标题】:urlopen for loop with beautifulsoupurlopen for 循环与 beautifulsoup
【发布时间】:2016-08-17 03:00:56
【问题描述】:

这里是新用户。我开始掌握 Python 语法的窍门,但一直被 for 循环所抛弃。我了解到目前为止我在 SO 上遇到的每个场景(以及我之前的示例),但似乎无法为我当前的场景想出一个。

我正在使用 BeautifulSoup 从应用商店中提取特征作为练习。

我创建了一个包含 GooglePlay 和 iTunes 网址的列表以供使用。

 list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
"https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
"https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
"https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
"https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
"https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
"https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
"https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
"https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
"https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
"https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}

为了测试beautifulsoup(我的代码中的bs),我为每个商店使用了一个应用程序:

gptest = bs(urllib.urlopen("https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en"))

ios = bs(urllib.urlopen("https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8"))

我在 iTunes 上找到了一个应用的类别:

print ios.find(itemprop="applicationCategory").get_text()

...在 Google Play 上:

print gptest.find(itemprop="genre").get_text()

有了这种新的信心,我想尝试遍历我的整个列表并输出这些值,但后来我意识到我很讨厌 for 循环......

这是我的尝试:

def opensite():
for item in list:
    bs(urllib.urlopen())

for item in list:
try:
    if "itunes.apple.com" in row:
        print "Category:", opensite.find(itemprop="applicationCategory").get_text()
    else if "play.google.com" in row:
        print "Category", opensite.find(itemprop="genre").get_text()
except:
    pass

注意:理想情况下,我会传递一个 csv(称为“样本”,其中有一列“URL”),所以我相信我的循环将从

for row in sample.URL:

但我认为向您展示列表比处理数据框更有帮助。

提前致谢!

【问题讨论】:

    标签: python for-loop beautifulsoup urlopen


    【解决方案1】:
    from __future__ import print_function   #
    try:                                    #
        from urllib import urlopen          # Support Python 2 and 3
    except ImportError:                     #
        from urllib.request import urlopen  #
    
    from bs4 import BeautifulSoup as bs
    
    for line in open('urls.dat'): # Read urls from file line by line
        doc = bs(urlopen(line.strip()), 'html5lib') # Strip \n from url, open it and parse
        if 'apple.com' in line:
            prop = 'applicationCategory'
        elif 'google.com' in line:
            prop = 'genre'
        else:
            continue
        print(doc.find(itemprop=prop).get_text())
    

    【讨论】:

      【解决方案2】:

      试试这个从列表中读取网址:

      from bs4 import BeautifulSoup as bs
      import urllib2
      import requests
      
      list = {"https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en",
      "https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en",
      "https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en",
      "https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en",
      "https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en",
      "https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en",
      "https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8",
      "https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8",
      "https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8",
      "https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8",
      "https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8"}
      
      def opensite():
          for item in list:
              bs(urllib2.urlopen(item),"html.parser")
              source = requests.get(item)
              text_new = source.text
              soup = bs(text_new, "html.parser")
      
              try:
                  if "itunes.apple.com" in item:
                      print item,"Category:",soup.find('span',{'itemprop':'applicationCategory'}).text
                  elif "play.google.com" in item:
                      print item,"Category:", soup.find('span',{'itemprop':'genre'}).text
              except:
                  pass
      
      opensite()
      

      它会像这样打印

      https://itunes.apple.com/us/app/doodle-jump/id307727765?mt=8 Category: Games
      https://play.google.com/store/apps/details?id=com.KnowledgeAdventure.SchoolOfDragons&hl=en Category: Role Playing
      https://play.google.com/store/apps/details?id=com.tov.google.ben10Xenodromeplus&hl=en Category: Role Playing
      https://itunes.apple.com/us/app/tiny-wings/id417817520?mt=8 Category: Games
      https://play.google.com/store/apps/details?id=com.doraemon.doraemonRepairShopSeasons&hl=en Category: Role Playing
      https://itunes.apple.com/us/app/angry-birds/id343200656?mt=8 Category: Games
      https://play.google.com/store/apps/details?id=com.indigokids.mimdoctor&hl=en Category: Role Playing
      https://itunes.apple.com/us/app/bike-race-pro/id510461370?mt=8 Category: Games
      https://play.google.com/store/apps/details?id=com.rovio.gold&hl=en Category: Role Playing
      https://play.google.com/store/apps/details?id=com.turner.stevenrpg&hl=en Category: Role Playing
      https://itunes.apple.com/us/app/flick-home-run-!/id454086751?mt=8 Category: Games
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-07-10
        • 1970-01-01
        • 1970-01-01
        • 2015-08-28
        • 2020-07-27
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多