Beautifulsoup - 网络爬虫的问题答案

【问题标题】：Beautifulsoup - Problems for webcrawlerBeautifulsoup - 网络爬虫的问题
【发布时间】：2018-03-08 13:53:26
【问题描述】：

如何正确输出本新闻网站的所有链接？（以列表形式）
以列表形式输出后，如何随机返回结果（一次3~5个链接）

注意：我需要的代码从第 739 行开始（几乎它可能会有所改变，因为它每天都会刷新）

div class="abdominis rlby clearmen"

我需要这种东西里面的每一个链接

<a href="https://tw.news.appledaily.com/life/realtime/20180308/1310910/>

谢谢！！代码如下：

from bs4 import BeautifulSoup
from flask import Flask, request, abort
import requests
import re
import random
import types    
target_url = 'http://www.appledaily.com.tw/realtimenews/section/new/'
print('Start parsing appleNews....')
rs = requests.session()
res = rs.get(target_url, verify=False)
soup = BeautifulSoup(res.text, 'html.parser')

#can output all links but with useless information
contents = soup.select("div[class='abdominis rlby clearmen']")[0].find_all('a')
print(contents)

#can output single link but not in list form
#contents = soup.select("div[class='abdominis rlby clearmen']")[0].find('a').get('href')
#print(contents)

【问题讨论】：

以列表形式输出后，如何随机返回结果（一次3~5个链接）..你能澄清一下你的意思..返回到哪里？
单行：[a['href'] for a in soup.select("div[class='abdominis rlby clearmen']")[0].find_all(href=True)]
@johnashu 实际上是指输出
因为我正在编写在线聊天机器人，所以它应该是“返回”XD

标签： python python-3.x python-2.7 beautifulsoup web-crawler

【解决方案1】：

这是一个解决方案，如果每个链接包含在指定的 div 中，它将附加到列表中..

from bs4 import BeautifulSoup
from flask import Flask, request, abort
import requests
import re
import random
import types    
target_url = 'http://www.appledaily.com.tw/realtimenews/section/new/'
print('Start parsing appleNews....')
rs = requests.session()
res = rs.get(target_url, verify=False)
soup = BeautifulSoup(res.text, 'html.parser')

list_links = [] # Create empty list

for a in soup.select("div[class='abdominis rlby clearmen']")[0].findAll(href=True): # find links based on div
    list_links.append(a['href']) #append to the list
    print(a['href']) #Check links

for l in list_links: # print list to screen (2nd check)
    print(l)

创建要返回的随机链接。

import random #import random module

random_list = [] #create random list if needed..
random.shuffle(list_links) #random shuffle the list

for i in range(5): # specify range (5 items in this instance)
    try:
        res = list_links.pop(random.randint(0, len(list_links))) # pop of each item randomly based on the size of the list
        print(res) #print to screen..
        random)list.append(res) # or append to random_list
    except IndexError:
        pass

您要求返回的最后一次编辑..

这是一个函数，它返回 x 数量的随机链接列表..

def return_random_link(list_, num):
    """ Takes in a list and returns a random amount of items """
    random.shuffle(list_)

    random_list = []

    for i in range(num):
        try: # try to append to the list
            r = list_.pop(random.randint(0, len(list_)))
            random_list.append(r)
        except IndexError: #except an IndexError (no items
            return random_list # Return the list of items

    return random_list

random_list = return_random_link(list_links, 5)

for i in random_list:
    print(i)

【讨论】：

最后一次编辑让您将其包装到一个函数中以使您的代码更好..
如果有帮助..请用绿色勾号选择正确的答案:)这将有助于未来的用户快速找到答案..
对不起，代码 res = list_links.pop(random.randint(0, len(list_links))) 有时会说 pop out of range 但有时不会（这意味着它可能是 randint(0 或者 randint(1 是不是因为链接资源？
对此感到抱歉.. 我添加了一个 try, except 块以在迭代完成后返回列表.. 现在应该可以工作了.. （函数）
请问0号有什么问题？？

【解决方案2】：

如果你想要链接标签没有它的后代，你可以清除它们：

for elm in contents:
    elm.clear()

不过，我想我对只提取链接更感兴趣：

contents = [a['href'] for a in contents]

要以随机顺序获取结果，请尝试使用 random.shuffle() 并一次从重新洗牌的列表中抓取任意数量的元素。

【讨论】：