Python Data Scraping with Beautiful Soup - 从 href 中获取数据答案

【问题标题】：Python Data Scraping with Beautiful Soup - geting Data from within a hrefPython Data Scraping with Beautiful Soup - 从 href 中获取数据
【发布时间】：2020-08-04 15:58:16
【问题描述】：

我对 Python 还很陌生，并且开始了解 Beautiful Soup。所以我有这个问题：我需要从活动公司获取数据，特别是联系数据。他们有这个主表，上面有所有参与者的姓名和他们的位置。但是要获取联系数据（电话、电子邮件），您需要按表中的每个公司名称，它会打开包含所有附加信息的新窗口。我正在寻找一种从 href 获取该信息并将其与主表中的数据相结合的方法。

所以我可以得到表格和所有的href：

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen


test_url = "https://standconstruction.messe-duesseldorf.de/vis/v1/en/hallindex/1.09?oid=2656&lang=2"
test_data = urlopen(test_url)
test_html = test_data.read()
test_data.close()

page_soup = soup(test_html, "html.parser")

test_table = page_soup.findAll("div", {"class": "exh-table-col"})
print(test_table)

结果我得到了所有表格并拥有这种信息（例如一行），包括名称、地址和 href：

<a class="flush" href="/vis/v1/en/exhibitors/aluminium2020.2661781?oid=2656&amp;lang=2">
<h2 class="exh-table-item__name" itemprop="name">Aerospace Engineering Equipment (Suzhou) Co LTD</h2>
</a>

</div>, <div class="exh-table-col exh-table-col--address">
<span class=""><i class="fa fa-map-marker"></i>  <span class="link-fix--text">Hall 9 / G57</span></span>

这就是我的问题开始的地方，我不知道如何从 href 中获取附加数据并将其与主要数据相结合。

我将非常感谢任何可能的解决方案或至少一个提示，我在哪里可以找到一个。

更新问题：我需要一个包含以下列信息的表： 1.姓名； 2.大厅； 3.PDF； 4.电话； 5.电子邮件。

如果您手动收集数据 - 要获取电话和电子邮件，您需要单击相应的链接以显示。我想知道是否有办法从这些链接中导出电话和电子邮件，并使用 Python 将它们添加到前 3 列。

【问题讨论】：

不清楚您所说的“如何从href中获取附加数据并将其与主要数据结合”是什么意思。请使用所需输出的示例编辑您的问题。
@JackFleeting 下面是他的目标
@αԋɱҽԃαмєяιcαη 谢谢！这正是我想要的。
@JackFleeting 感谢您的提示。我更新了问题。我希望现在会更清楚，以防有人遇到同样的问题。

标签： python web-scraping beautifulsoup href

【解决方案1】：

import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep

params = {
    "oid": "2656",
    "lang": "2"
}


def main(url):
    with requests.Session() as req:
        r = req.get(url, params=params)
        soup = BeautifulSoup(r.content, 'html.parser')

        target = soup.select("div.exh-table-item")

        names = [name.h2.text for name in target]
        hall = [hall.span.text.strip() for hall in target]
        pdf = [pdf.select_one("a.color--darkest")['href'] for pdf in target]
        links = [f"{url[:46]}{link.a['href']}" for link in target]

        phones = []
        emails = []

        for num, link in enumerate(links):
            print(f"Extracting {num +1} of {len(links)}")

            r = req.get(link)

            soup = BeautifulSoup(r.content, 'html.parser')
            goal = soup.select_one("div[class^=push--bottom]")

            try:
                phone = goal.select_one("span[itemprop=telephone]").text
            except:
                phone = "N/A"

            try:
                email = goal.select_one("a[itemprop=email]").text
            except:
                email = "N/A"

            emails.append(email)
            phones.append(phone)
            sleep(1)

        df = pd.DataFrame(list(zip(names, hall, pdf, phones, emails)), columns=[
                          "Name", "Hall", "PDF", "Phone", "Email"])
        print(df)
        df.to_csv("data.csv", index=False)


main("https://standconstruction.messe-duesseldorf.de/vis/v1/en/hallindex/1.09")

输出：View Online

【讨论】：

只有一个问题我的理解，这个参数，你如何选择正确的数字？参数 = { "oid": "2656", "lang": "2" }
@Se_D 这是您在原始链接中使用的参数！
True...完全忽略了它。谢谢！