没有任何警告的 Python Web Scraping 错误答案

【问题标题】：Python Web Scraping Error without Any Warning没有任何警告的 Python Web Scraping 错误
【发布时间】：2021-08-26 05:39:56
【问题描述】：

我正在尝试从网页中抓取一些文本并使用以下代码将它们保存在文本文件中（我正在从名为 links.txt 的文本文件中打开链接）：

import requests
import csv
import random
import string
import re

from bs4 import BeautifulSoup

#Create random string of specific length
def randStr(chars = string.ascii_uppercase + string.digits, N=10):
    return ''.join(random.choice(chars) for _ in range(N))
    
with open("links.txt", "r") as a_file:
  for line in a_file:
    stripped_line = line.strip()
    endpoint = stripped_line
    response = requests.get(endpoint)
    data = response.text
    soup = BeautifulSoup(data, "html.parser")
    for pictags in soup.find_all('col-md-2'):
        lastfilename = randStr()
        file = open(lastfilename + ".txt", "w")
        file.write(pictags.txt)
        file.close()
        print(stripped_line)

网页具有以下属性：

<div class="col-md-2">

问题是在运行代码后发生了注释，我没有收到任何错误。

【问题讨论】：

你想从那个页面抓取什么？你能解释一下吗

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

要将页面中的所有关键字文本放入文件中，您可以：

import requests
from bs4 import BeautifulSoup

url = "http://www.mykeyworder.com/keywords?tags=dog&exclude=&language=en"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

with open("data.txt", "w") as f_out:
    for inp in soup.select('input[type="checkbox"]'):
        print(inp["value"], file=f_out)

这会创建 data.txt 的内容：

dog
animal
canine
pet
cute
puppy
happy
young
adorable

...and so on.

【讨论】：

【解决方案2】：

从 BeautifulSoup here 的文档中，您可以看到您的行 for pictags in soup.find_all('col-md-2') 将搜索标签名称为“col-md-2”的任何元素，而不是具有该类名称的元素。换句话说，你的代码会像<col-md-2></col-md-2>这样搜索元素。

你修复你的代码，然后再试一次or pictags in soup.find_all(class_='col-md-2')

【讨论】：

谢谢，我尝试了您的建议并收到此错误“file.write(pictags.txt) TypeError: write() argument must be str, not Tag”。很抱歉打扰您，任何建议都非常感谢
@KatherineElizabethKath : 如果你想从检索到的 HTML 标签中获取文本内容，你可以试试file.write(pictags.text)

【解决方案3】：

您可以将元素与相关属性进行匹配。将字典传递给 find_all 的 attrs 参数您要查找的元素的所需属性。

pictags = soup.find_all(attrs={'class':'col-md-2'})

这将找到所有具有类 'col-md-2' 的元素

【讨论】：