【问题标题】:Web scraping from Website getting dynamic data without api从网站抓取没有 api 的动态数据
【发布时间】:2021-09-25 03:59:02
【问题描述】:

我正在尝试查找该站点从中获取数据的 api,但我找不到它。

网站链接: https://govservices.dcra.dc.gov/contractorratingsystem/BuildingProfessionals/BuildingProfessional?profType=General%20Contractor&profName=

在网络选项卡中,我可以看到 xhr 响应中的数据。每次选择其他页面时都会更改数据,但我必须提取数据但不知道该怎么做。我不知道网站从哪里获取数据。我对此完全陌生。你能指导我如何获取数据或抓取这个网站吗?我试图找到与此相关的示例,但没有得到正确的示例。 提前致谢。

【问题讨论】:

    标签: python web-scraping python-requests


    【解决方案1】:

    下面的代码将为您提供您正在寻找的数据。

    使用字段recordCount 来设置您需要循环的range

    它是如何工作的

    网站正在使用 API 调用以获取 JSON 格式的数据。 它使用分页技术 - 它将页面索引和页面大小传递给服务器,因此服务器知道页面偏移量是多少,并且知道要返回哪些数据。下面的代码模拟了这个活动——循环增加页面索引,这样我们就可以遍历数据。

    import requests
    import time
    
    headers = {
        "accept": "application/json, text/javascript, */*; q=0.01",
        "accept-language": "en-US,en;q=0.9,el;q=0.8,he;q=0.7,de;q=0.6,fr;q=0.5,it;q=0.4,es;q=0.3",
        "cache-control": "no-cache",
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "pragma": "no-cache",
        "sec-ch-ua": "\"Google Chrome\";v=\"93\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"93\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"macOS\"",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "sec-fetch-site": "same-origin",
        "x-requested-with": "XMLHttpRequest"
    }
    body = {'professionalType': 'General Contractor',
            'Name': '',
            'sortName': 'OverallScore',
            'sortDirection': 'desc',
            'pageIndex': 0,
            'pageSize': 10}
    url = 'https://govservices.dcra.dc.gov/contractorratingsystem/BuildingProfessionals/LoadProfessionalSearchResultsWithFilters'
    for i in range(1, 3):  # TODO use actual range based on 'recordCount' (in the response) and 'pageSize'
        body['pageIndex'] = i
        r = requests.post(url, headers=headers, data=body)
        if r.status_code == 200:
            print(f'{i} --> {r.json()}')
        else:
            print(f'status code is {r.status_code}')
        time.sleep(1)
    

    输出

    1 --> {'buildingProfessionals': [{'buildingProfessional': 'REVOLUTION SOLAR LLC.', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410518000062', 'projectCount': 822, 'planReviewScore': 96.1732900783996, 'applicationIntakeScore': 96.2433090024331, 'inspectionScore': 100, 'overAllProjectScore': 100, 'stopWorkOrders': 0, 'planReviewScoreRating': 4.80866450391998, 'applicationIntakeScoreRating': 4.812165450121655, 'inspectionScoreRating': 5, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'mattyoungarl@gmail.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '10746 JUDY LANE COLUMBIA MD 21044', 'businessPhone': '4438655039', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'AMERICAN AUTOMATIC SPRINKLER CO', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410514000016', 'projectCount': 471, 'planReviewScore': 0.29603315571344, 'applicationIntakeScore': 0.245115452930728, 'inspectionScore': 99.7122042886194, 'overAllProjectScore': 100, 'stopWorkOrders': 12, 'planReviewScoreRating': 0.014801657785672, 'applicationIntakeScoreRating': 0.0122557726465364, 'inspectionScoreRating': 4.98561021443097, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'aasco@aasc-fp.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '3149 DRAPER DRIVE FAIRFAX VA 22031', 'businessPhone': '7038498180', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'FIRE & LIFE SAFETY AMERICA INC.', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410516000410', 'projectCount': 348, 'planReviewScore': 0.218818380743982, 'applicationIntakeScore': 0.218818380743982, 'inspectionScore': 99.781181619256, 'overAllProjectScore': 100, 'stopWorkOrders': 2, 'planReviewScoreRating': 0.0109409190371991, 'applicationIntakeScoreRating': 0.0109409190371991, 'inspectionScoreRating': 4.9890590809628, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'bdrinkard@flsamerica.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '3017 VERNON ROAD RICHMOND VA 23228', 'businessPhone': '8042221381', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'NORTHERN FIRE PROTECTION, INC.', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410516000183', 'projectCount': 250, 'planReviewScore': 0, 'applicationIntakeScore': 0, 'inspectionScore': 98.7889273356401, 'overAllProjectScore': 100, 'stopWorkOrders': 2, 'planReviewScoreRating': 0, 'applicationIntakeScoreRating': 0, 'inspectionScoreRating': 4.939446366782005, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'nstarcarol@aol.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '21530 BLACKWOOD COURT SUITE #150 STERLING VA 20166', 'businessPhone': '7034069811', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'PHOENIX FIRE PROTECTION INC.', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410518000155', 'projectCount': 174, 'planReviewScore': 0, 'applicationIntakeScore': 0, 'inspectionScore': 100, 'overAllProjectScore': 100, 'stopWorkOrders': 4, 'planReviewScoreRating': 0, 'applicationIntakeScoreRating': 0, 'inspectionScoreRating': 5, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': '', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '7901 PENN RANDALL PLACE UPPER MARLBORO MD 20772', 'businessPhone': '3016697066', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'JENSON FIRE PROTECTION INC', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410517000309', 'projectCount': 146, 'planReviewScore': 0.632911392405063, 'applicationIntakeScore': 0.632911392405063, 'inspectionScore': 100, 'overAllProjectScore': 100, 'stopWorkOrders': 0, 'planReviewScoreRating': 0.03164556962025315, 'applicationIntakeScoreRating': 0.03164556962025315, 'inspectionScoreRating': 5, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'sung@jensonfireprotection.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '8740 CHERRY LANE UNIT 13 LAUREL MD 20707', 'businessPhone': '', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'LIVINGSTON FIRE PROTECTION INC', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410516000203', 'projectCount': 145, 'planReviewScore': 0, 'applicationIntakeScore': 0, 'inspectionScore': 98.3734939759036, 'overAllProjectScore': 100, 'stopWorkOrders': 5, 'planReviewScoreRating': 0, 'applicationIntakeScoreRating': 0, 'inspectionScoreRating': 4.91867469879518, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'info@livfire.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '5150 LAWRENCE PLACE HYATTSVILLE MD 20781', 'businessPhone': '3017794466', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'RIDGEWAY CORPORATION PROFESSIONAL CORPORATION', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410518000087', 'projectCount': 145, 'planReviewScore': 0.798403193612774, 'applicationIntakeScore': 1.4251497005988, 'inspectionScore': 95.688622754491, 'overAllProjectScore': 100, 'stopWorkOrders': 4, 'planReviewScoreRating': 0.0399201596806387, 'applicationIntakeScoreRating': 0.07125748502994, 'inspectionScoreRating': 4.78443113772455, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'ridgecorpusa@outlook.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '12514 KENSINGTON LANE BOWIE MD 20715', 'businessPhone': '3014642003', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'FORTRESS PROTECTION GROUP', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410518000115', 'projectCount': 124, 'planReviewScore': 0, 'applicationIntakeScore': 0, 'inspectionScore': 99.4932432432432, 'overAllProjectScore': 100, 'stopWorkOrders': 5, 'planReviewScoreRating': 0, 'applicationIntakeScoreRating': 0, 'inspectionScoreRating': 4.97466216216216, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'todd.patterson@fortresspg.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '18618 BROKEN OAK RD BOYDS MD 20841', 'businessPhone': '', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}, {'buildingProfessional': 'PRIME FIRE PROTECTION LLC', 'buildingProfessionalType': 'General-Contractor', 'permitType': None, 'businessName': None, 'contactNumber': '410517000488', 'projectCount': 120, 'planReviewScore': 0, 'applicationIntakeScore': 0, 'inspectionScore': 94.4320987654321, 'overAllProjectScore': 100, 'stopWorkOrders': 20, 'planReviewScoreRating': 0, 'applicationIntakeScoreRating': 0, 'inspectionScoreRating': 4.721604938271605, 'overAllProjectScoreRating': 5, 'useCategory': None, 'businessEmail': 'vmalca@primefireprotection.com', 'imageName': 'noimage.png', 'imageUrl': 'https://govservices.dcra.dc.gov/ProfessionalImages/noimage.png', 'businessAddress': '13549 JAMIESON PL GERMANTOWN MD 20874', 'businessPhone': '3104736189', 'flag': '', 'professionalDisplayName': 'General Contractor', 'webAddress': 'N/A', 'bbb': 'NOT ACCREDITED'}], 'pageIndex': 1, 'pageSize': 10, 'recordCount': 1113}
    ...
    

    【讨论】:

    • 谢谢@balderman 我可以得到整个代码的解释,以便我知道代码是如何工作的以及为什么工作。
    • 当然 - 我会将其添加到答案中。你测试了吗?
    • 是的,先生,我已经对其进行了测试并且可以正常工作。
    • 先生,您能再给我一些例子或网站供我参考,以便我学习这个东西
    • 你需要了解网页抓取。请接受答案。
    【解决方案2】:

    您可以使用请求来点击生成响应的 url,然后使用 Beautiful soup 来解析它吗?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-01-18
      • 1970-01-01
      • 1970-01-01
      • 2010-09-17
      • 2020-07-02
      • 2014-07-06
      相关资源
      最近更新 更多