【问题标题】:Constructing a table from multiple links in one从多个链接中构建一个表
【发布时间】:2019-03-25 16:07:48
【问题描述】:

我需要从一个网站中提取数据,在该网站上我已经提取了托管数据的 url 列表,并且我能够提取数据,但我无法以表格形式提取数据。

我尝试了多个代码,我提取了 href 链接,然后将它们附加到一个列表中。我正在使用请求和漂亮的汤库来提取数据。

url = 'https://www.flinders.edu.au/directory/index.cfm/search/results?page=1&lastnamesearch=A&firstnamesearch=&ousearch='

for rows in df_link['Name']:
url = rows
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
for table in soup.find_all('table', {'summary' : 'Staff list that match search criteria'}):
    n_columns = 0
    n_rows = 0
    column_names = []

    column_names = [th.get_text() for th in table.select('th')]
    n_columns = len(column_names)

    rows = table.select('tr')[1:]
    n_rows = len(rows)

    df = pd.DataFrame(columns=column_names, index=range(n_rows))

    r_index = 0
    for row in rows:
        c_index = 0
        for cell in row.select('td'):
            anchor = cell.select_one('a')
            df.iat[r_index, c_index] = anchor.get('href') if anchor else cell.get_text()

            c_index += 1
        r_index += 1

    #c_index = 1
    #for nam in row.find_all('a', {'class' : 'directory directory-person'}):

     #   df.iat[r_index, c_index] = nam.get_text()

      #  c_index += 1
    #r_index += 1

    print(df)

urls = []
for row in df['Name\xa0⬆']:
   urls.append(link+row)

for url in urls:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    for name in soup.find_all('span' , {'class' : 'directory directory-entity'}):
        results['Name'] = table.text
    p = []
    for row in soup.find_all('tr'):
        position = row.find_all('td')
        p.append(position[0].text)
        results['Position'] = p[1]
        results['Phone'] = p[4]
        results['Email'] = p[9].replace('\n', '')
    print(results)

我期待结果以表格形式出现。非常感谢您的帮助

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests


    【解决方案1】:

    您可以使用 pandas 和 BeautifulSoup 4.7.1 执行以下操作。

    import requests
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    
    baseUrl = 'https://www.flinders.edu.au'
    
    emails = []
    positions = []
    
    with requests.Session() as s:
        r = s.get('https://www.flinders.edu.au/directory/index.cfm/search/results?page=1&lastnamesearch=A&firstnamesearch=&ousearch=')
        soup = bs(r.content, 'lxml')
        names, urls = zip(*[ (item.text, baseUrl + item['href']) for item in soup.select('td:first-child a')])
        tels = [item.text for item in soup.select('td:nth-of-type(2) a')]
    
        for url in urls:
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            positions.append(soup.select_one('.staffInfo + td').text)
            emails.append(soup.select_one('[href^=mailto]').text)
    
    final = list(zip(names, tels, positions, emails))
    df = pd.DataFrame(final, columns = ['name', 'tel', 'position', 'email'])
    print(df.head())
    df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
    

    样本输出:


    如果您对姓名和电话有疑问,您还可以执行以下操作:

    with requests.Session() as s:
        r = s.get('https://www.flinders.edu.au/directory/index.cfm/search/results?page=1&lastnamesearch=A&firstnamesearch=&ousearch=')
        soup = bs(r.content, 'lxml')
        data =  [item.text for item in soup.select('.directory-person')]
        names = data[0::2]
        tels = data[1::2]
    

    【讨论】:

    • 如果我使用上面的代码,我得到一个错误“只有以下伪类被实现:nth-​​of-type。”如果我使用类型 (1) 的第 n 个,我只会得到 1 个输出。你能帮忙吗?
    • 是在 soup.select('td:first-child a') 上发生的吗?这可以重写为 td:nth-of-type(1) a
    • 我正在使用 bs4 最新版本 4.7.1 顺便说一句
    • 当我写成 td:nth-of-type(1) 时,它只输出名字。另外,我尝试通过 pip 更新 bs4,但它说要求已经满足版本 4.6.3
    • 是的,成功了,感谢您的帮助
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-07-23
    • 1970-01-01
    • 2014-08-12
    相关资源
    最近更新 更多