【问题标题】:How do I web-scrape a JSP with Python, Selenium and BeautifulSoup?如何使用 Python、Selenium 和 BeautifulSoup 对 JSP 进行网络抓取?
【发布时间】:2020-01-07 08:40:39
【问题描述】:

我是使用 Python 进行网络抓取的绝对初学者。 我正在尝试从此 URL 中提取 ATM 的位置:

https://www.visa.com/atmlocator/mobile/index.jsp#(page:results,params:(query:'Tokyo,%20Japan'))

使用以下代码。

#Script to scrape locations and addresses from VISA's ATM locator


# import the necessary libraries (to be installed if not available):

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd


#ChromeDriver
#(see https://chromedriver.chromium.org/getting-started as reference)

driver = webdriver.Chrome("C:/Users/DefaultUser/Local Settings/Application Data/Google/Chrome/Application/chromedriver.exe")

offices=[] #List to branches/ATM names
addresses=[] #List to branches/ATM locations
driver.get("https://www.visa.com/atmlocator/mobile/index.jsp#(page:results,params:(query:'Tokyo,%20Japan'))") 


content = driver.page_source
soup = BeautifulSoup(content, features = "lxml")


#the following code extracts all the content inside the tags displaying the information requested

for a in soup.findAll('li',attrs={'class':'visaATMResultListItem'}): 
    name=a.find('li', attrs={'class':'data-label'}) 
    address=a.find('li', attrs={'class':'data-label'}) 
    offices.append(name.text)
    addresses.append(address.text)


#next row defines the dataframe with the results of the extraction

df = pd.DataFrame({'Office':offices,'Address':addresses})


#next row displays dataframe content

print(df)


#export data to .CSV file named 'branches.csv'
with open('branches.csv', 'a') as f:
    df.to_csv(f, header=True)

该脚本一开始似乎可以正常工作,因为 Chromedriver 启动并在浏览器中显示所需的结果,但没有返回任何结果:

Empty DataFrame
Columns: [Office, Address]
Index: []
Process finished with exit code 0

也许我在选择选择器时犯了错误?

非常感谢您的帮助

【问题讨论】:

    标签: python pandas selenium web-scraping beautifulsoup


    【解决方案1】:

    问题出在定位器上,使用

    for a in soup.findAll('li',attrs={'class':'visaATMResultListItem'}): 
        name = a.find('p', attrs={'class':'visaATMPlaceName '}) 
        address = a.find('p', attrs={'class':'visaATMAddress'}) 
        offices.append(name.text)
        addresses.append(address.text)
    

    【讨论】:

    • 我试过了,但仍然:Empty DataFrame Columns: [Office, Address] Index: [] Process finished with exit code 0
    【解决方案2】:
    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    import time
    from bs4 import BeautifulSoup
    import csv
    
    options = Options()
    options.add_argument('--headless')
    
    driver = webdriver.Firefox(options=options)
    driver.get("https://www.visa.com/atmlocator/mobile/index.jsp#(page:results,params:(query:'Tokyo,%20JAPAN'))")
    time.sleep(2)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    na = []
    addr = []
    for name in soup.findAll("a", {'class': 'visaATMPlaceLink'}):
        na.append(name.text)
    for add in soup.findAll("p", {'class': 'visaATMAddress'}):
        addr.append(add.get_text(strip=True, separator=" "))
    
    with open('out.csv', 'w', newline="") as f:
        writer = csv.writer(f)
        writer.writerow(['Name', 'Address'])
        for _na, _addr in zip(na, addr):
            writer.writerow([_na, _addr])
    
    driver.quit()
    

    输出:Click-Here

    【讨论】:

    • 这对我有用,在将 driver = webdriver.Firefox(options=options) 替换为 webdriver.Firefox(executable_path="C:/Users/DefaultUser/AppData/geckodriver.exe") 并将 geckodriver 放在该文件夹中之后。非常感谢
    • @Edo 似乎这意味着您没有将 geckodriver 包含在 Python 文件夹中。或者您将其安装为user
    • 接受并点赞(但“感谢您的反馈!声望低于 15 人的投票将被记录,但不要更改公开显示的帖子得分”。)
    猜你喜欢
    • 1970-01-01
    • 2020-09-13
    • 1970-01-01
    • 2022-11-07
    • 2020-08-09
    • 1970-01-01
    • 2019-07-14
    • 1970-01-01
    • 2017-10-08
    相关资源
    最近更新 更多