【问题标题】:Scraping an additional link and appending it to the list抓取一个附加链接并将其附加到列表中
【发布时间】:2021-11-16 15:28:44
【问题描述】:

我遇到了一个问题,我不知道该怎么做。

我已经抓取了多个页面的公司名称、位置和省份,以及指向另一个页面上的其他信息的链接。我收集的链接提供了我需要的另外 3 条信息。

我需要访问链接,取出地址、电话号码(如果有的话)和 CNAE 代码,并将其附加到之前的数据中。

我目前拥有的第一次抓取的工作脚本如下:

import requests
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,65)]

allurls = baseurl + urls
print(allurls)

for url in allurls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.select("div#simulacion_tabla ul")

    #scrape the pages
    for lis in lists:
        title = lis.find('li', class_="col1").text
        location = lis.find('li', class_="col2").text
        province = lis.find('li', class_="col3").text
        link = lis.select("li.col1 a")[0]['href']
        info = [title, location, province, link]
        print(info)

在第二页上,数据位于一个表格中,其 id 名称如下。这是我认为我需要使用的代码,但它不起作用,我正在兜圈子试图找出原因:

section = soup.select("section#datos_empresa")
lslinks = link

for ls in lslinks
    location = lis.find('tr', id_="tamano_empresa").text
    cnae = lis.find('tr', id_="cnae_codigo_empresa").text
    phone = lis.find('tr', id_="telefono_empresa").text
    addinfo = [location, cnae, phone]
info.append(addinfo)

这是links 之一的示例

理想的输出是:
['AGRICOLA CALLEJA SL', 'CARPIO', 'VALLADOLID', 'https://www.expansion.com/directorio-empresas/agricola-calleja-sl_1480101_A02_47.html', C/ LA TORRE, 2., 150, 983863247 ]

我会将其写入文本文件,以便将其导入 excel。

任何帮助将不胜感激!

干杯!

【问题讨论】:

  • 那么您对该页面的预期输出是什么? (请edit您提出问题以包含它 - 或者至少是它的一个起始示例)
  • 会做的,我只是从 python 和 stack 开始,所以我的边缘有点粗糙!

标签: python loops web-scraping beautifulsoup


【解决方案1】:

这是迄今为止最小的工作解决方案。

代码:

import requests
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,5)]#range(2,65)]

allurls = baseurl + urls
#print(allurls)
data = []
for url in allurls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.select("div#simulacion_tabla ul")

    #scrape the pages
    for lis in lists:
        title = lis.find('li', class_="col1").text
        location = lis.find('li', class_="col2").text
        province = lis.find('li', class_="col3").text
        link = lis.select_one("li.col1 a")['href']
        #info = [title, location, province, link]
        #print(info)

        sub_page = requests.get(link)
        soup2 = BeautifulSoup(sub_page.content, "html.parser")
        direction = soup2.select_one('#direccion_empresa').text
        cnae = soup2.select_one('#cnae_codigo_empresa').text
        phone=soup2.select_one('#telefono_empresa')
        telephoe = phone.text if phone else None
        print([title,location,province,link,direction,cnae,telephoe])
        #data.append([title, location, province,link, direction, cnae, telephoe])


#cols = ["title", "location", "province","link", "direction", "cnae", "telephoe"]

#df = pd.DataFrame(data, columns=cols)
#print(df)
#df.to_csv('info.csv',index = False)

输出:

['A CORTIÑA DOS ACIVROS SL', 'LUGO', 'LUGO', 'https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html', 'CRTA. A CORUÑA, 16.', '150', '']
['A CORTIÑA DOS ACIVROS SL', 'LUGO', 'LUGO', 'https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html', 'CRTA. A CORUÑA, 16.', '150', '']
['A P V 19 32 SL', 'VALENCIA', 'VALENCIA', 'https://www.expansion.com/directorio-empresas/a-p-v-19-32-sl_672893_A02_46.html', 'CALLE SALVA, 8 1 2B.', '150', '']
['ABADIA DE JABUGO SL', 'CARTAYA', 'HUELVA', 'https://www.expansion.com/directorio-empresas/abadia-de-jabugo-sl_5442689_A02_21.html', 'URB. MARINA EL ROMPIDO, 31 VILLA M-31. CRTA. EL RO.', '150', '']
['ABALOS REAL SLL', 'CARBONERAS DE GUADAZAON', 'CUENCA', 'https://www.expansion.com/directorio-empresas/abalos-real-sll_1239004_A02_16.html', 'C/ DON CRUZ, 23.', '150', '969142092']

...等等

【讨论】:

  • 它工作得很好,只需要编辑一些东西(最终打印语句和循环页面!感谢 Fazlul
【解决方案2】:

在您的子页面中,您尝试选择 ID 而不是该部分的类,因此无法匹配任何条目。您也可以使用td

您的子页面逻辑需要与您的主页相结合。请尝试以下操作:

import requests
from bs4 import BeautifulSoup
import csv

with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(["Title", "Location", "Province", "Link", "Location", "cnae", "Phone"])
    
    urls = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
    urls.extend(f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html' for i in range(2, 65))

    for url in urls:
        print(url)
        
        r_main = requests.get(url)
        soup_main = BeautifulSoup(r_main.content, "html.parser")

        for lis in soup_main.select("div#simulacion_tabla ul"):
            title = lis.find('li', class_="col1").text
            location = lis.find('li', class_="col2").text
            province = lis.find('li', class_="col3").text
            link = lis.select("li.col1 a")[0]['href']
            
            print(' ', link)
            r_sub = requests.get(link)
            soup_sub = BeautifulSoup(r_sub.content, "html.parser")
            
            section = soup_sub.select_one("section.datos_empresa")
            location = section.find('td', id="tamano_empresa").text
            cnae = section.find('td', id="cnae_codigo_empresa").text
            phone = section.find('td', id="telefono_empresa").text

            csv_output.writerow([title, location, province, link, location, cnae, phone])

这将创建一个 CSV 输出文件,开始:

Title,Location,Province,Link,Location,cnae,Phone
A CORTIÑA DOS ACIVROS SL,DESCONOCIDO,LUGO,https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html,DESCONOCIDO,150,
A CORTIÑA DOS ACIVROS SL,DESCONOCIDO,LUGO,https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html,DESCONOCIDO,150,
A P V 19 32 SL,MICROEMPRESA,VALENCIA,https://www.expansion.com/directorio-empresas/a-p-v-19-32-sl_672893_A02_46.html,MICROEMPRESA,150,
ABADIA DE JABUGO SL,DESCONOCIDO,HUELVA,https://www.expansion.com/directorio-empresas/abadia-de-jabugo-sl_5442689_A02_21.html,DESCONOCIDO,150,
ABALOS REAL SLL,MICROEMPRESA,CUENCA,https://www.expansion.com/directorio-empresas/abalos-real-sll_1239004_A02_16.html,MICROEMPRESA,150,969142092

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-04-16
    • 1970-01-01
    • 2014-10-11
    • 2018-01-13
    • 1970-01-01
    • 2019-11-23
    相关资源
    最近更新 更多