【问题标题】:Getting AttributeError: 'NoneType' object has no attribute 'text' (web-scraping)获取 AttributeError:“NoneType”对象没有属性“文本”(网络抓取)
【发布时间】:2020-10-29 00:57:40
【问题描述】:

这是我关于网络抓取的案例研究。 我在最终代码中遇到问题 'NoneType' 对象没有属性 'text' 所以我尝试使用 'getattr' 函数来修复它,但它不起作用。

'''

import requests
from bs4 import BeautifulSoup

url = 'https://www.birdsnest.com.au/womens/dresses'

source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')

'''

productlist= soup.find_all('div', id='items')

'''

productlinks = []
for item in productlist:
  for link in item.find_all('a',href=True):
      productlinks.append(url + link['href'])
print(len(productlinks))

'''

productlinks = []
for x in range(1,28):
  source = requests.get(f'https://www.birdsnest.com.au/womens/dresses?_lh=1&page={x}')
  soup = BeautifulSoup(source.content, 'lxml')
  for item in productlist:
      for link in item.find_all('a',href=True):
        productlinks.append(url + link['href'])
print(productlinks)

'''

for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = soup.find('h1',class_='item-heading__name').text.strip()
    price = soup.find('p',class_='item-heading__price').text.strip()
    feature = soup.find('div',class_='tab-accordion__content active').text.strip()

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

'''

  ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-d4d46558690d> in <module>()
      3     soup = BeautifulSoup(source.content, 'lxml')
      4 
----> 5     name = soup.find('h1',class_='item-heading__name').text.strip()
      6     price = soup.find('p',class_='item-heading__price').text.strip()
      7     feature = soup.find('div',class_='tab-accordion__content active').text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

---------------------------------------------------------------------------

所以我尝试用这种方法修复,但没有奏效。

 for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
    price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
    feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

这是输出。它只显示“Nonetype”

{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}

【问题讨论】:

标签: python selenium web-scraping beautifulsoup google-colaboratory


【解决方案1】:

首先,始终为您正在抓取的页面关闭JS。然后你会意识到标签类发生了变化,而这些正是你想要定位的。

另外,在循环浏览页面时,不要忘记 Python 的 range() 停止值不包含在内。这意味着,这个range(1, 28) 将在页面27 上停止。

下面是我的做法:

import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)

这会给你一个JSON 和所有的衣服。

    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...

【讨论】:

  • 这是我的输出,我使用你所有的代码 {从第 1 页抓取数据...从第 2 页抓取数据...从第 3 页抓取数据...从第 4 页抓取数据。 .. 从第 5 页抓取数据...从第 6 页抓取数据...从第 7 页抓取数据...}
  • 嗯...一旦脚本完成,输出将在JSON 文件中。这是将输出保存到文件 open("all_the_dresses2.json", "w") 的行
  • 哦,我明白了,您使用的是 Google Colab。我对此一无所知,但您可以添加此行 print(f"{all_pages}\nFound: {len(all_pages)} dresses.") 以打印结果。
  • 天哪!有用。感谢一百万您的帮助。我尝试修复它很多次。谢谢
  • 如果您觉得我的回答有用,请投票和/或接受它 - stackoverflow.com/help/someone-answers
猜你喜欢
  • 2019-03-11
  • 1970-01-01
  • 2016-05-06
  • 1970-01-01
  • 2013-11-06
  • 1970-01-01
  • 2022-07-21
  • 2020-12-09
  • 1970-01-01
相关资源
最近更新 更多