【问题标题】:Can't scrape a graphql page using requests无法使用请求抓取 graphql 页面
【发布时间】:2021-07-21 10:56:31
【问题描述】:

我正在尝试使用请求模块从网页中抓取公司名称及其相应链接。

虽然内容是动态的,但我注意到它们在window.props 旁边的大括号中可用。

所以,我想挖出那部分并使用 json 处理它,但我看到 \u0022 字符而不是引号 "。这就是我的意思:

{\u0022firms\u0022: [{\u0022index\u0022: 1, \u0022slug\u0022: \u0022zjjz\u002Datelier\u0022, \u0022name\u0022:

我试过了:

import re
import json
import requests
from bs4 import BeautifulSoup

link = 'https://architizer.com/firms/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    items = re.findall(r'window.props[^"]+(.*?);',r.text)[0].strip('"').replace('\u0022', '\'')
    print(items)

如何使用请求从该网页中获取遍历多个页面的不同公司的名称和链接?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup graphql


    【解决方案1】:

    嗯,那很有趣。

    您正在处理由GraphQL 提供支持的页面,因此您必须正确模拟请求。

    此外,他们希望您发送 Referer Header 以及 csfr 令牌。这可以很容易地从最初的 HTML 中提取出来并在后续请求中重用。

    这是我的看法:

    import time
    
    import requests
    from bs4 import BeautifulSoup
    
    link = 'https://architizer.com/firms/'
    query = """{ allFirmsWithProjects( first: 6, after: "6", firmType: "Architecture / Design Firm", firmName: "All Firm Names", projectType: "All Project Types", projectLocation: "All Project Locations", firmLocation: "All Firm Locations", orderBy: "recently-featured", affiliationSlug: "", ) { firms: edges { cursor node { index id: firmId slug: firmSlug name: firmName projectsCount: firmProjectsCount lastProjectDate: firmLastProjectDate media: firmLogoUrl projects { edges { node { slug: slug media: heroUrl mediaId: heroId isHiddenFromListings } } } } } pageInfo { hasNextPage endCursor } totalCount } }"""
    
    
    def query_graphql(page_number: int = 6) -> dict:
        q = query.replace(f'after: "6"', f'after: "{str(page_number)}"')
        return s.post(
            "https://architizer.com/api/v3.0/graphql",
            json={"query": q},
        ).json()
    
    
    def has_next_page(graphql_response: dict) -> bool:
        return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["hasNextPage"]
    
    
    def get_next_page(graphql_response: dict) -> int:
        return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["endCursor"]
    
    
    def get_firms_data(graphql_response: dict) -> list:
        return graphql_response["data"]["allFirmsWithProjects"]["firms"]
    
    
    def parse_firms_data(firms: list) -> str:
        return "\n".join(firm["node"]["name"] for firm in firms)
    
    
    def wait_a_bit(wait_for: float = 1.5):
        time.sleep(wait_for)
    
    
    with requests.Session() as s:
        s.headers["user-agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
        s.headers["referer"] = "https://architizer.com/firms/"
    
        csrf_token = BeautifulSoup(
            s.get(link).text, "html.parser"
        ).find("input", {"name": "csrfmiddlewaretoken"})["value"]
    
        s.headers.update({"x-csrftoken": csrf_token})
    
        response = query_graphql()
        while True:
            if not has_next_page(response):
                break
            print(parse_firms_data(get_firms_data(response)))
            wait_a_bit()
            response = query_graphql(get_next_page(response))
    

    为了举例,这应该输出公司名称:

    Brooks + Scarpa Architects
    Studio Saxe
    NiMa Design
    Best Practice Architecture
    Gensler
    Inca Hernandez
    kaa studio
    Taller Sintesis
    Coryn Kempster and Julia Jamrozik
    Franklin Azzi Architecture
    Wittman Estes
    Masfernandez Arquitectos
    MATIAS LOPEZ LLOVET
    SRG Partnership, Inc.
    GANA Arquitectura
    Meyer & Associates Architects, Urban Designers
    Steyn Studio
    BGLA architecture | urban design
    
    and so on ...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-06-24
      • 2021-06-19
      • 2020-05-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多