【问题标题】:Unable to iterate over multiple pages while web scraping网页抓取时无法遍历多个页面
【发布时间】:2020-11-08 17:28:44
【问题描述】:

我正在尝试刮

https://www.maybank.co.id/others/locate-us?Keyword=&LocType=branch&LocSubType=all 

获取所有银行分行的branch name and address。有44 pages 我需要刮掉,url 不会改变。我无法遍历页面。

for page_no in range(1,45):

    payload='page='+str(page_no)+'&PageSize=9&id=%7B5066AC98-FE40-407A-B4FE-03C814BED5F5%7D&keyword=&LocType=branch&LocSubType=all'
    response = requests.post(url, data=payload)
    page = requests.post(url,data=payload)
    print('Page',page_no)
    for i in soup.find_all('div',class_="col-md-4 col-sm-6 col-xs-12 property-item"):
        Branch=i.find_all('h3') if i.find_all('h3') else ''
        Address=i.find_all('p') if i.find_all('p') else '' 
    for j in Address:
        j = re.sub(r'<(.*?)>', '', str(j))
        j = j.strip()
        Address_list.append(j)
    for k in Branch:
        k=re.sub(r'<(.*?)>', '', str(k))
        Branch_list.append(k)

有人建议应该在这里做吗?

【问题讨论】:

    标签: for-loop url web-scraping beautifulsoup request


    【解决方案1】:

    你应该使用 API 来获得你需要的东西。

    试试这个:

    from urllib.parse import urlencode
    
    import requests
    from bs4 import BeautifulSoup
    
    
    api_url = "https://www.maybank.co.id/api/sitecore/MapsLocation/MapsLocationListPaging?"
    
    payload = {
        "page": "44",
        "id": "{5066AC98-FE40-407A-B4FE-03C814BED5F5}",
        "keyword": "",
        "LocType": "branch",
        "LocSubType": "all",
    }
    
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
        "x-requested-with": "XMLHttpRequest",
    }
    
    
    for page in range(1, 45):
        payload["PageSize"] = page
        page = requests.get(f"{api_url}{urlencode(payload)}", headers).text
        soup = BeautifulSoup(page, "html.parser").find("div", {"class": "col-md-4 col-sm-6 col-xs-12 property-item"})
        branch_data = [
            soup.find("h3").getText(strip=True),
            [p.getText(strip=True) for p in soup.find_all("p")],
            soup.find("a")["href"],
        ]
        print(branch_data)
    

    输出:

    ['KC MANADO', ['Jl. Kawasan Mega Mas Jl. Pierre Tendean Boulevard Blok I C1 No. 24,25,26 dan Blok I C2 No. 27,28,29 Manado', 'Closed until 03.30 PM0431 - 860543'], '/others/locate-us/locate-us-detail?id=337&loctype=Branch&locsubtype=']
    ['KC SUNSET ROAD, DPS', ['Jl. Sunset Road No 811, Kuta  - Badung, Bali', 'Closed until 03.30 PM0361 - 3003811'], '/others/locate-us/locate-us-detail?id=294&loctype=Branch&locsubtype=']
    ['KCP BSB CITY', ['Ruko Taman Niaga Bukit Semarang Baru (BSB) Blok E No. 3A, Semarang', 'Closed until 03.30 PM(024) 76670611'], '/others/locate-us/locate-us-detail?id=217&loctype=Branch&locsubtype=']
    ['KCP GRAHA IRAMA', ['Jl. HR Rasuna Said Kav. 1-2 Ground Floor Blok B Jakarta Selatan', 'Closed until 03.30 PM021-5261330-4'], '/others/locate-us/locate-us-detail?id=111&loctype=Branch&locsubtype=']
    ['KCP KLP. GADING BULEVARD II', ['Jl. Raya Boulevard I-3 no. 4, Jakarta', 'Closed until 03.30 PM021 - 4515253'], '/others/locate-us/locate-us-detail?id=199&loctype=Branch&locsubtype=']
    ['KCP PALM SPRING BATAM CENTER', ['Komplek Palm Spring BTC Blok D1 No. 10, Batam Centre', 'Closed until 03.30 PM0778 - 6053070'], '/others/locate-us/locate-us-detail?id=26&loctype=Branch&locsubtype=']
    and so on...
    

    【讨论】:

    • 你好,请问api url是怎么获取的?
    • 转到开发者工具 -> 网络 -> XHR。您可以在此处找到请求 URL。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-09-14
    • 2016-11-19
    • 2021-02-22
    • 1970-01-01
    • 2019-10-14
    • 2022-11-28
    • 1970-01-01
    相关资源
    最近更新 更多