【问题标题】:Unable to extract div tags from the webpage using beautiful soup?无法使用美汤​​从网页中提取 div 标签?
【发布时间】:2020-04-07 12:52:32
【问题描述】:

我正在尝试使用 beautifulsoup 从以下链接中提取一些信息: https://aiesec.org/opportunity/1212595 我需要的是项目名称和开始日期。但是,我无法提取名称,它总是给出无。

 title = soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['opportunity-tile', ''])

经过进一步分析,我发现它甚至没有得到 div 标签,因为以下返回 none "

print(soup.find_all("div"))

我哪里错了?

【问题讨论】:

    标签: python-3.x selenium web-scraping beautifulsoup


    【解决方案1】:
    import requests
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
        'Authorization': 'e316ebe109dd84ed16734e5161a2d236d0a7e6daf499941f7c110078e3c75493'}
    data = {"operationName": "OpportunityQuery", "variables": {"id": "1212595", "cdn_region": "Global"}, "query": "query OpportunityQuery($id: ID, $cdn_region: String) {\n  getOpportunity(id: $id) {\n    application_processing_time\n    applied_to\n    applied_to_with\n    applications_close_date\n    available_openings\n    backgrounds {\n      constant_id\n      constant_name\n      option\n      __typename\n    }\n    branch {\n      id\n      address_detail {\n        id\n        city\n        country\n        __typename\n      }\n      company {\n        id\n        name\n        profile_photo(cdn_region: $cdn_region)\n        __typename\n      }\n      __typename\n    }\n    cover_photo(cdn_region: $cdn_region)\n    description\n    duration\n    project_duration\n    earliest_start_date\n    google_place_id\n    home_lc {\n      id\n      email\n      full_name\n      parent {\n        id\n        name\n        __typename\n      }\n      __typename\n    }\n    id\n    is_favourited\n    is_gep\n    languages {\n      constant_id\n      constant_name\n      option\n      __typename\n    }\n    lat\n    latest_end_date\n    lng\n    legal_info {\n      health_insurance_info\n      visa_duration\n      visa_link\n      visa_type\n      __typename\n    }\n    location\n    logistics_info {\n      accommodation_covered\n      accommodation_provided\n      food_covered\n      __typename\n    }\n    nationalities {\n      constant_id\n      constant_name\n      option\n      __typename\n    }\n    office_footfall_for_exchange\n    openings\n    opportunity_cost\n    opportunity_questions {\n      edges {\n        node {\n          id\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    organisation {\n      id\n      name\n      __typename\n    }\n    percentage_of_fulfillment\n    programme {\n      id\n      short_name_display\n      __typename\n    }\n    remark\n    reviews\n    role_info {\n      selection_process\n      learning_points_list\n      __typename\n    }\n    sdg_info {\n      id\n      sdg_target {\n        description\n        goal_index\n        id\n        parent {\n          id\n          __typename\n        }\n        target\n        __typename\n      }\n      __typename\n    }\n    selection_processes(first: 50) {\n      edges {\n        cursor\n        node {\n          id\n          title\n          no_of_days\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    skills {\n      constant_id\n      constant_name\n      option\n      __typename\n    }\n    specifics_info {\n      computer\n      expected_work_schedule\n      ef_test_required\n      salary\n      salary_currency {\n        id\n        alphabetic_code\n        __typename\n      }\n      salary_periodicity\n      saturday_work\n      __typename\n    }\n    status\n    study_levels {\n      id\n      name\n      __typename\n    }\n    title\n    transparent_fee_details {\n      covers_accomodation\n      covers_administrative_costs\n      covers_leadership_spaces\n      covers_pickup\n      sponsored_by\n      __typename\n    }\n    __typename\n  }\n}\n"}
    
    r = requests.post('https://gis-api.aiesec.org/graphql',
                      json=data, headers=headers).json()
    
    print(r['data']['getOpportunity']['title'])
    print(r['data']['getOpportunity']['earliest_start_date'])
    print(r['data']['getOpportunity']['applications_close_date'])
    print(r['data']['getOpportunity']['latest_end_date'])
    

    输出:

    [Teaching] Impact India - English Language Teacher  
    2019-11-20T00:00:00Z
    2019-10-30T00:00:00Z
    2020-11-20T00:00:00Z
    

    【讨论】:

    • 非常感谢。这有帮助。但是你能告诉我哪里出错了吗?
    • 欢迎您,我刚刚使用了API,因为网站实际使用JavaScript,需要先渲染才能加载数据。您可以为此使用selenium。无论如何,我已经使用API 来简化您的操作。
    猜你喜欢
    • 2020-03-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-11-07
    • 2021-05-22
    • 2016-12-04
    • 2021-02-08
    相关资源
    最近更新 更多