编写循环：获取 URL 列表并仅获取标题文本和元描述 - BeautifulSoup/Python答案

【问题标题】：Writing A Loop: Taking a List of URLS And Only Getting The Title Text and Meta Description - BeautifulSoup/Python编写循环：获取 URL 列表并仅获取标题文本和元描述 - BeautifulSoup/Python
【发布时间】：2020-03-31 15:29:30
【问题描述】：

我是公共卫生领域的一名相当新的数据工作者。任何帮助表示赞赏。

基本上，我们的目标是创建一种从 URL 列表中提取标题和元描述的快速方法。我们正在使用 Python。我们不需要网页中的任何其他内容。

我有一个名为“urlList”的列表。我已经写出了（使用 Beautiful Soup）

urlList  = https://www.freeclinics.com/cit/ca-los_angeles?sa=X&ved=2ahUKEwjew7SbgMXoAhUJZc0KHYHUB-oQ9QF6BAgIEAI,
https://www.freeclinics.com/cit/ca-los_angeles,
https://www.freeclinics.com/co/ca-los_angeles,
http://cretscmhd.psych.ucla.edu/healthfair/HF%20Services/LinkingPeopletoServices_CLinics_List_bySPA.pdf

然后我能够提取其中一个 URL 的标题和描述（见下面的代码）。我希望在列表中循环这个。我对任何形式的数据导出持开放态度 - 即它可以是数据表、.csv 或 .txt 文件。

我知道我当前的打印输出将标题和描述显示为字符串，其中描述输出在 [ ] 中。这可以。我对这篇文章的主要关注是遍历整个 urlList。

urlList = "https://www.freeclinics.com/cit/ca-los_angeles?sa=X&ved=2ahUKEwjew7SbgMXoAhUJZc0KHYHUB-oQ9QF6BAgIEAI"

response = requests.get(urlList)
soup = BeautifulSoup(response.text)
metas = soup.find_all('meta')

print((soup.title.string),[ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ])

>> Output: Free and Income Based Clinics Los Angeles CA ['Search below and find all of the free and income based health clinics in Los Angeles CA. We have listed out all of the Free Clinics listings in Los Angeles, CA below']

P.s - urlList 最多有 10-20 个链接。所有的页面结构都非常相似。

【问题讨论】：

标签： python loops web-scraping beautifulsoup

【解决方案1】：

您可以定义一个以urlList 作为参数并返回列表列表的函数，其中主列表中的每个子列表都包含title 及其对应的description。

试试这个：

def extract_info(url_list):
    info = []
    for url in url_list:
        with requests.get(url) as response:
            soup = BeautifulSoup(response.text, "lxml")
            title = soup.find('title') .text if soup.find('title') else None
            description = soup.find('meta', {"name": "description"})["content"] if soup.find('meta', {"name": "description"}) else None
            info.append([title, description])
    return info

输出：

[['Free and Income Based Clinics Los Angeles CA',
  'Search below and find all of the free and income based health clinics in '
  'Los Angeles CA. We have listed out all of the Free Clinics listings in Los '
  'Angeles, CA below']
...
]]

【讨论】：

你好，亲爱的 Shubham - 非常感谢这种方法。这看起来很有趣，我认为我可以从中学习。 cf 另一个标题为：编写循环：beautifulsoup-and-lxml-for-getting...postet 于 3 月 31 日 - 很想从你那里得到一些想法.. - 最重要的是 - 在这里做得很好。非常感谢！！