【问题标题】:How to use the lxml to scrape a table and grab the href links?如何使用 lxml 抓取表格并获取 href 链接?
【发布时间】:2019-06-05 20:48:38
【问题描述】:

在 Python 3 中,我有这个程序可以使用 lxml 从站点中提取表,然后创建一个数据框(基于 Syed Sadat Nazrul 的 - https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059):

import requests
import lxml.html as lh
import pandas as pd

# Sample site where the table is
response = requests.get('https://especiais.gazetadopovo.com.br/futebol/tabela-campeonato-brasileiro-2018')

#Store the contents of the website under doc
doc = lh.fromstring(response.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]

    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break

    #i is the index of our column
    i=0

    #Iterate through each element of the row
    for t in T.iterchildren():

        data=t.text_content() 


        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

# Creates the dataframe
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

但它的表在其中一列中有href,在第一列中,表中没有名称:

<td class="campeao times link-time"><a href="https://especiais.gazetadopovo.com.br/futebol/times/palmeiras/">Palmeiras</a></td>

所以我想从每一行中提取href并放入数据库的一列中:

                P   J   V   E   D   GP  GC  SG  Link
0   Palmeiras   80  38  23  11  4   64  26  38  https://especiais.gazetadopovo.com.br/futebol/times/palmeiras/
1   Flamengo    72  38  21  9   8   59  29  30  https://especiais.gazetadopovo.com.br/futebol/times/flamengo/

...

请,“iterchildren”中的迭代采用“text_content”的文本。有没有办法同时获得嵌入的 href 链接?

【问题讨论】:

    标签: python pandas python-requests lxml


    【解决方案1】:

    您可以通过以下方式获取链接:

    import re
    import requests
    import pandas as pd
    import lxml.html as lh
    
    response = requests.get('https://especiais.gazetadopovo.com.br/futebol/tabela-campeonato-brasileiro-2018')
    links = re.findall('times link-time"><a href="(https:.*times.*)\"', response.text)
    doc = lh.fromstring(response.content)
    tr_elements = doc.xpath('//tr')
    col = []
    i = 0
    for t in tr_elements[0]:
        i += 1
        name = t.text_content()
        col.append((name, []))
    
    for j in range(1, len(tr_elements)):
        T = tr_elements[j]
        if len(T) != 10:
            break
        i = 0
        for t in T.iterchildren():
            data = t.text_content()
            if i > 0:
                try:
                    data = int(data)
                except:
                    pass
            col[i][1].append(data)
            i += 1
    
    Dict = {title: column for (title, column) in col}
    Dict['Link'] = links
    df = pd.DataFrame(Dict)
    

    我在最后得到了这个:

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-05-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-01
      • 2014-05-15
      • 2016-06-23
      • 1970-01-01
      相关资源
      最近更新 更多