【问题标题】:Automating a Python Web Crawler - How to prevent raw_input all the time?自动化 Python Web Crawler - 如何一直防止 raw_input?
【发布时间】:2015-11-01 05:00:16
【问题描述】:

我一直在尝试创建一个 Python Web Crawler,它可以查找网页、读取链接列表、在预先指定的位置返回链接,并执行一定次数(由 count 变量定义) .我的问题是我一直无法找到一种方法来自动化这个过程,我必须不断地输入代码找到的链接。

这是我的代码: 第一个网址是http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html

count_1 等于 7

位置等于8

##Here is my code:

import urllib 
from bs4 import BeautifulSoup

count_1 = raw_input('Enter count: ')
position = raw_input('Enter position: ')
count = int(count_1)

while count > 0:
    list_of_tags = list()
    url = raw_input("Enter URL: ")
    fhand = urllib.urlopen(url).read()
    soup = BeautifulSoup(fhand,"lxml")
    tags = soup("a")
    for tag in tags:
        list_of_tags.append(tag.get("href",None))
    print list_of_tags[int(position)]
    count -=1

感谢所有帮助

【问题讨论】:

    标签: python while-loop web-scraping web-crawler


    【解决方案1】:

    我已经用 cmets 准备了一些代码。如果您有任何疑问或其他问题,请告诉我。

    给你:

    import requests
    from lxml import html
    
    
    def searchRecordInSpecificPosition(url, position):
        ## Making request to the specified URL
        response = requests.get(url)
    
        ## Parsing the DOM to a tree
        tree = html.fromstring(response.content)
    
        ## Creating a dict of links.
        links_dict = dict()
    
        ## Format of the dictionary:
        ##
        ##  {
        ##      1: {
        ##          'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html",
        ##          'text': "Medina"
        ##      },
        ##      
        ##      2: {
        ##          'href': "http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Chiara.html",
        ##          'text': "Chiara"
        ##      },
        ##  
        ##      ... and so on...
        ## }
    
        counter = 1
    
        ## For each <a> tag found, extract its text and link (href) and insert it into links_dict
        for link in tree.xpath('//ul/li/a'):
            href = link.xpath('.//@href')[0]
            text = link.xpath('.//text()')[0]
            links_dict[counter] = dict(href=href, text=text)
            counter += 1
    
        return links_dict[position]['text'], links_dict[position]['href']
    
    
    times_to_search = int(raw_input("Enter the amount of times to search: "))
    position = int(raw_input('Enter position: '))
    
    count = 0
    
    print ""
    
    while count < times_to_search:
        if count == 0:
            name, url = searchRecordInSpecificPosition("http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Brenae.html", position)
        else:
            name, url = searchRecordInSpecificPosition(url, position)
        print "[*] Name: {}".format(name)
        print "[*] URL: {}".format(url)
        print ""
        count += 1
    

    示例输出:

    ➜  python scraper.py
    Enter the amount of times to search: 4
    Enter position: 1
    
    [*] Name: Medina
    [*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Medina.html
    
    [*] Name: Darrius
    [*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Darrius.html
    
    [*] Name: Caydence
    [*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Caydence.html
    
    [*] Name: Peaches
    [*] URL: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Peaches.html
    
    ➜ 
    

    【讨论】:

      猜你喜欢
      • 2010-09-13
      • 2015-08-18
      • 2015-06-30
      • 2011-12-20
      • 2023-03-29
      • 2022-10-30
      • 1970-01-01
      • 1970-01-01
      • 2015-06-08
      相关资源
      最近更新 更多