【问题标题】:How to scrape a webpage that uses javascript?如何抓取使用 javascript 的网页?
【发布时间】:2021-06-25 13:46:24
【问题描述】:

我正在使用 requests 和 BeautifulSoup 从房地产网站抓取数据。它有几个编号的“页面”,显示了几十个公寓。我编写了一个循环运行所有这些页面并从公寓收集数据,但不幸的是它们使用 javascript,因此,代码只返回第一页的公寓。我也尝试过使用 selenium,但遇到了同样的问题。

非常感谢您的任何建议!

代码如下:

# Create empty lists to append data scraped from URL
# Number of lists depends on the number of features you want to extract

lista_preco = []
lista_endereco = []
lista_tamanho = []
lista_quartos = []
lista_banheiros = []
lista_vagas = []
lista_condominio = []
lista_amenidades = []
lista_fotos = []
lista_sites = []

n_pages = 0

for page in range(1, 15):
    n_pages += 1
    url = "https://www.vivareal.com.br/venda/bahia/salvador/apartamento_residencial/"+'?pagina='+str(page)
    url = requests.get(url)
    soup = BeautifulSoup(url.content, 'html.parser')
    house_containers = soup.find_all('div', {'class' :'js-card-selector'})
    if house_containers != []:
        for container in house_containers:
            
            # Price
            price = container.find_all('section', class_='property-card__values')[0].text
            try:
                price = int(price[:price.find('C')].replace('R$', '').replace('.','').strip())
            except:
                price = 0
            lista_preco.append(price)

            # Zone
            location = container.find_all('span', class_='property-card__address')[0].text
            location = location.strip()
            lista_endereco.append(location)

            # Size
            size = container.find_all('span', class_='property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area')[0].text
            if '-' not in size:
                size = int(size[:size.find('m')].replace(',','').strip())
            else:
                size = int(size[:size.find('-')].replace(',','').strip())
            lista_tamanho.append(size)

            # Rooms
            quartos = container.find_all('li', class_='property-card__detail-item property-card__detail-room js-property-detail-rooms')[0].text
            quartos = quartos[:quartos.find('Q')].strip()
            if '-' in quartos:
                quartos = quartos[:quartos.find('-')].strip()
            lista_quartos.append(int(quartos))
            
            # Bathrooms
            banheiros = container.find_all('li', class_='property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom')[0].text
            banheiros = banheiros[:banheiros.find('B')].strip()
            if '-' in banheiros:
                banheiros = banheiros[:banheiros.find('-')].strip()
            lista_banheiros.append(int(banheiros))
            
            # Garage
            vagas = container.find_all('li', class_='property-card__detail-item property-card__detail-garage js-property-detail-garages')[0].text
            vagas = vagas[:vagas.find('V')].strip()
            if '--' in vagas:
                vagas = '0'
            lista_vagas.append(int(vagas))

            # Condomínio
            condominio = container.find_all('section', class_='property-card__values')[0].text
            try:
                condominio = int(condominio[condominio.rfind('R$'):].replace('R$','').replace('.','').strip())
            except:
                condominio = 0
            lista_condominio.append(condominio)

            # Amenidades
            try:
                amenidades = container.find_all('ul', class_='property-card__amenities')[0].text
                amenidades = amenidades.split()
            except:
                amenidades = 'Zero'
            lista_amenidades.append(amenidades)

            # url
            link = 'https://www.vivareal.com.br/' + container.find_all('a')[0].get('href')[1:-1]
            lista_sites.append(link)

            # image
            #p = str(container.find_all('img')[0])
            #p

            #2x size thumbnail

            #imgurl = p[p.find('https'):p.rfind('data-src')]
            #imgurl.replace('"', '').strip()
            #lista_fotos.append(imgurl)
    else:
        break
    
    time.sleep(randint(1,2))
    
print('You scraped {} pages containing {} properties.'.format(n_pages, len(lista_preco)))```

【问题讨论】:

    标签: javascript python beautifulsoup python-requests


    【解决方案1】:

    您确实可以选择。无需使用 Selenium,您可以通过 api 访问数据。

    网站有一个限制,只允许您对最多 10,000 个列表进行分页。返回的数据比您想要的要多得多,因此您可以查看该 json 响应并查看是否还有其他要添加的内容:

    代码:

    import pandas as pd
    import requests
    import math
    import time
    import random
    
    url = 'https://glue-api.vivareal.com/v2/listings'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
               'x-domain': 'www.vivareal.com.br'}
    payload = {
    'addressCity': 'Salvador',
    'addressLocationId': 'BR>Bahia>NULL>Salvador',
    'addressNeighborhood': '',
    'addressState': 'Bahia',
    'addressCountry': 'Brasil',
    'addressStreet': '',
    'addressZone': '',
    'addressPointLat': '-12.977738',
    'addressPointLon': '-38.501636',
    'business': 'SALE',
    'facets': 'amenities',
    'unitTypes': 'APARTMENT',
    'unitSubTypes': 'UnitSubType_NONE,DUPLEX,LOFT,STUDIO,TRIPLEX',
    'unitTypesV3': 'APARTMENT',
    'usageTypes': 'RESIDENTIAL',
    'listingType': 'USED',
    'parentId': 'null',
    'categoryPage': 'RESULT',
    'size': '350',
    'from': '0',
    'q': '',
    'developmentsSize': '5',
    '__vt': '',
    'levels': 'CITY,UNIT_TYPE',
    'ref': '/venda/bahia/salvador/apartamento_residencial/',
    'pointRadius':''}
    
    
    def get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData):
        randInt = random.uniform(5.1, 7.9)
        payload.update({'from':'0'})
        #time.sleep(randInt)
        if priceMax > 2500000:
            priceMax = 100000000
        payload.update({'priceMin':'%s' %priceMin,'priceMax':'%s' %priceMax})
        jsonData = requests.get(url, headers=headers, params=payload).json()
        listings_count = jsonData['search']['totalCount']
        
        if listings_count < 10000:
            if priceMax < 100000000:
                print ('Price range %s - %s returns %s listings.' %(priceMin, priceMax, listings_count))
                previous_jsonData = jsonData
                previous_priceMax = priceMax
                priceMax += 25000
                listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, previous_priceMax, jsonData, previous_jsonData)
            else:
                previous_jsonData = jsonData
                previous_priceMax = 100000000
            
        priceMin = previous_priceMax + 1
        priceMax = priceMin + 250000 - 1
        return listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData
        
    
    rows = []
    priceMin = 1
    priceMax = 250000
    finished = False
    aquired = []
    while finished == False:
        randInt = random.uniform(5.1, 7.9)
        listings_count, priceMin, priceMax, previous_priceMax, jsonData, previous_jsonData = get_num_of_listings(priceMin, priceMax, payload, url, None, None, None)
        total_pages = math.ceil(previous_jsonData['search']['totalCount'] / 350)
            
        for page in range(1, total_pages+1):
            if page == 1:
                idx=0
                jsonData = previous_jsonData
            else:
                idx = 350*page
                payload.update({'from':'%s' %idx})
                if idx == 9800:
                    payload.update({'size':200})
                else:
                    payload.update({'size':350})
                 
                if idx > 9800:
                    continue
                #time.sleep(randInt)
                jsonData = requests.get(url, headers=headers, params=payload).json()
            
            listings = jsonData['search']['result']['listings']
            for listing in listings:
                listingId = listing['listing']['id']
                if listingId in aquired:
                    continue
                zone = listing['listing']['address']['zone']
                size = listing['listing']['usableAreas'][0]
                bedrooms = listing['listing']['bedrooms'][0]
                bathrooms = listing['listing']['bathrooms'][0]
                if listing['listing']['parkingSpaces'] != []:
                    parking = listing['listing']['parkingSpaces'][0]
                else:
                    parking = None
                price = listing['listing']['pricingInfos'][0]['price']
                try:
                    condoFee =  listing['listing']['pricingInfos'][0]['monthlyCondoFee']
                except:
                    condoFee =  None
                amenities = listing['listing']['amenities']
                listingUrl = 'https://www.vivareal.com.br' + listing['link']['href']
                    
                row = {
                'Id':listingId,
                'Zone' : zone,
                'Size' : size,
                'Bedrooms' : bedrooms,
                'Bathrooms': bathrooms,
                'Garage' : parking,
                'Price': price,
                'Condominio' : condoFee,
                'Amenidades' : amenities,
                'url' : listingUrl}
                
                aquired.append(listingId)
    
                rows.append(row)
            print('Page %s of %s' %(page, total_pages))
        if priceMax > 100000000:
            print('Done')
            finished = True
        
    df = pd.DataFrame(rows)
    

    输出:

    IPdb [3]: print(df)
                   Id  ...                                                url
    0      2511396476  ...  https://www.vivareal.com.br/imovel/apartamento...
    1      2494354474  ...  https://www.vivareal.com.br/imovel/apartamento...
    2      2504461896  ...  https://www.vivareal.com.br/imovel/apartamento...
    3      2508574459  ...  https://www.vivareal.com.br/imovel/apartamento...
    4      2511489082  ...  https://www.vivareal.com.br/imovel/apartamento...
              ...  ...                                                ...
    26244    94618731  ...  https://www.vivareal.com.br/imovel/apartamento...
    26245    93437597  ...  https://www.vivareal.com.br/imovel/apartamento...
    26246    79341843  ...  https://www.vivareal.com.br/imovel/apartamento...
    26247  2455978575  ...  https://www.vivareal.com.br/imovel/apartamento...
    26248  2509913182  ...  https://www.vivareal.com.br/imovel/apartamento...
    
    [26249 rows x 10 columns]
    

    【讨论】:

    • 我非常感谢你,伙计!这确实有效并解决了我的问题,尽管我很难理解它。你从哪里得到那个 API? (glue-api.vivareal.com/v2/listings) 再次感谢!干得好,非常感谢!
    • 如果您转到开发工具(ctrl+shift+I,或右键单击并选择“检查”),开发工具面板应该会打开。如果您查看 Network -> XHR 选项卡,您将看到发出请求的位置(您可能需要刷新浏览器才能看到它在那里弹出)。只是在那里钓鱼,您会看到数据以 json 格式返回。
    • 我如何调整此代码以废弃由于网站限制而无法请求的其余信息?如果这个问题有点愚蠢,我很抱歉,我对python有点陌生。我尝试在此处更改 thie for 循环:for page in range(1, total_pages+1),但无法提出解决方案。
    • 根本不是一个愚蠢的问题。你必须四处看看,看看哪些参数/过滤器可以让你得到一个更小的子集,然后可能会遍历它。我没有在该有效负载中添加一些参数。我今天已经完成了,但我明天可以偷看一下,如果我找到了一个简单的方法,请告诉你
    • 非常感谢@chitown88!我还没有弄清楚如何提取超过 10k 限制的所有数据,但我正在努力。非常感谢!
    【解决方案2】:

    很遗憾,我相信您别无选择。原因是使用新的前端技术,html 是异步呈现的,它需要“真实”环境才能使 javascript 能够运行和加载页面。例如,使用 Ajax,您将需要一个真正的浏览器(Chrome、Firefox)才能使其工作。所以,我的建议是你应该继续深入研究 Selenium 并模仿 click 事件来点击每个页面(点击像 1..2..3 这样的页码直到最后)然后等到数据加载,然后阅读 html 和提取您需要的数据。 问候。

    【讨论】:

    • 非常感谢! Selenium 是否可以点击 javascript 链接?
    • 那么scrapy呢?它适用于 javascript 页面吗?
    • selenium 或任何自动化框架的想法是通过操纵浏览器来模仿人类的行为。因此,Selenium 或其他人(木偶、黄瓜等)也可以重复您可以手动执行的任何操作。您应该等到链接完全呈现然后单击,然后获取 html ,然后再次单击,再次获取等等。
    • @abbehusen 如果您觉得我的想法有用,请将其标记为答案,谢谢!
    猜你喜欢
    • 1970-01-01
    • 2011-07-30
    • 2017-03-07
    • 1970-01-01
    • 2011-12-24
    相关资源
    最近更新 更多