使用 Python 自动进行网页抓取（无类/ID）答案

【问题标题】：Automated webscraping with Python (no class/id)使用 Python 自动进行网页抓取（无类/ID）
【发布时间】：2017-04-30 09:40:13
【问题描述】：

目前我正在开发我的第一个 Python 项目。我正在尝试构建一个简单的程序来自动从网站http://socialblade.com 进行网络抓取我们的想法是创建一个包含 youtube 用户（excel、csv、...）的列表，并将其输入 python 脚本。然后，Python 会抓取用户页面并生成一个包含最新每日浏览量或订阅者数量的 csv 文件。

我遵循了一些关于 BS4 的教程，请求，......但我被卡住了。似乎在 socialblade 上没有我想抓取的 div 的类/id 标签。

例如。这是我想收集的物品之一的代码。

<div style="width: 140px; float: left;">16,518
</div>

除此之外，我还不确定如何在 Python 程序中将链接提供给不同的用户。目前我们有一个包含用户列表（行）的文件。其中一列是指向他们的 Youtube 帐户的链接。

这是我想要做的非常基本的事情：

对于用户 1 到 n 1）从excel文件中为用户读取链接 2) 从 socialblade 页面中删除“观看次数”和“订阅次数” 3) 将此数据写入 csv/excel 文件

希望这有点可以理解:)

非常感谢，期待提高我的 Python 技能！

问候，周末愉快！

【问题讨论】：

标签： python excel python-3.x web-scraping beautifulsoup

【解决方案1】：

好吧，如果我理解这一点，首先你必须在 Excel 文件中有用户列表，我没有，所以在我的情况下，我使用此代码获得前 25 名并将其保存到 xlsx 文件中：

from openpyxl import load_workbook, Workbook
from lxml.html import fromstring
import requests


def get_number_of_views_and_subscriptions(socialblade_url="https://socialblade.com/youtube/"):
    """Function returns account names, account urls, number of subscribers and number of views
     from socialblade web-site using requests and xpath"""

    request = requests.get(socialblade_url)
    tree = fromstring(request.content)

    account_names = tree.xpath("/html/body/div[9]/div[1]/div/div[3]/a/text()")
    account_urls = ["https://socialblade.com" + _ for _ in tree.xpath("/html/body/div[9]/div[1]/div/div[3]/a/@href")]
    subscribers = tree.xpath("/html/body/div[9]/div[1]/div/div[5]/text()")
    views = tree.xpath("/html/body/div[9]/div[1]/div/div[6]/text()")

    data = zip(account_names, account_urls, subscribers, views)

    return data


def writing_to_excel(file_path="users_data.xlsx", data=get_number_of_views_and_subscriptions()):
    """Function writes data of type ["account names", "account urls", "number of subscribers", "number of views"]
    to an xlsx file"""

    workbook = Workbook()
    worksheet = workbook.create_sheet("Socialblade", 0)
    worksheet.append(["account names", "account urls", "number of subscribers", "number of views"])

    for item in data:
        worksheet.append(item)

    workbook.save(file_path)

接下来是获取链接和抓取信息，我会使用以下代码：

def get_excel_user_links(file_path="users_data.xlsx"):
    """Functions returns all values of the first row of Excel file"""

    workbook = load_workbook(filename=file_path)
    worksheet = workbook.active  # or workbook.get_sheet_by_name("Sheet1")

    values = [row[1].value for row in worksheet.iter_rows() if row[1].value != "account urls"]
    return values

def scrape_and_save_to_excel(file_path="scraped_data.xlsx", user_links=get_excel_user_links()):
    """Function scrapes users data and saves it to xlsx"""

    data = [["user link", "number of views", "number of subscribers"]]

    for user_link in user_links:
        request = requests.get(user_link)
        tree = fromstring(request.content)

        number_of_views = tree.xpath('//*[@id="YouTubeUserTopInfoBlock"]/div[4]/span[2]/text()')[0]
        number_of_subscribers = tree.xpath('//*[@id="YouTubeUserTopInfoBlock"]/div[3]/span[2]/text()')[0]

        data.append([user_link, number_of_views, number_of_subscribers])

    workbook = Workbook()
    worksheet = workbook.create_sheet("Socialblade", 0)
    for item in data:
        worksheet.append(item)

    workbook.save(file_path)

【讨论】：

非常感谢您的回复！当我试图通过代码工作时，我想知道 excel 文件是否应该在同一目录中？谢谢！
只是为了确保，我必须将此代码保存为 .py 文件，然后在 cmd 中运行，对吗？ :) 这似乎不会生成 excel 文件。我错过了什么吗？
嗯，这是函数，要运行它们，您必须在代码末尾添加像“get_number_of_views_and_subscriptions()”这样的行。是的，您必须将其添加到 .py 文件并使用 python 运行它。但是最好安装一些IDLE来使用python。
另一个问题 :) 你是如何确定要抓取哪些“div”的？非常感谢！
如果 div 没有 id 你可以使用索引，例如 html 树中的第五个 div 的 xpath 是 "//div[5]" 等等。尝试学习 xpath。