如何使用 lxml 抓取表格并获取 href 链接？答案

【问题标题】：How to use the lxml to scrape a table and grab the href links?如何使用 lxml 抓取表格并获取 href 链接？
【发布时间】：2019-06-05 20:48:38
【问题描述】：

在 Python 3 中，我有这个程序可以使用 lxml 从站点中提取表，然后创建一个数据框（基于 Syed Sadat Nazrul 的 - https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059）：

import requests
import lxml.html as lh
import pandas as pd

# Sample site where the table is
response = requests.get('https://especiais.gazetadopovo.com.br/futebol/tabela-campeonato-brasileiro-2018')

#Store the contents of the website under doc
doc = lh.fromstring(response.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]

    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break

    #i is the index of our column
    i=0

    #Iterate through each element of the row
    for t in T.iterchildren():

        data=t.text_content() 


        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

# Creates the dataframe
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

但它的表在其中一列中有href，在第一列中，表中没有名称：

<td class="campeao times link-time"><a href="https://especiais.gazetadopovo.com.br/futebol/times/palmeiras/">Palmeiras</a></td>

所以我想从每一行中提取href并放入数据库的一列中：

                P   J   V   E   D   GP  GC  SG  Link
0   Palmeiras   80  38  23  11  4   64  26  38  https://especiais.gazetadopovo.com.br/futebol/times/palmeiras/
1   Flamengo    72  38  21  9   8   59  29  30  https://especiais.gazetadopovo.com.br/futebol/times/flamengo/

...

请，“iterchildren”中的迭代采用“text_content”的文本。有没有办法同时获得嵌入的 href 链接？

【问题讨论】：

标签： python pandas python-requests lxml

【解决方案1】：

您可以通过以下方式获取链接：

import re
import requests
import pandas as pd
import lxml.html as lh

response = requests.get('https://especiais.gazetadopovo.com.br/futebol/tabela-campeonato-brasileiro-2018')
links = re.findall('times link-time"><a href="(https:.*times.*)\"', response.text)
doc = lh.fromstring(response.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
    i += 1
    name = t.text_content()
    col.append((name, []))

for j in range(1, len(tr_elements)):
    T = tr_elements[j]
    if len(T) != 10:
        break
    i = 0
    for t in T.iterchildren():
        data = t.text_content()
        if i > 0:
            try:
                data = int(data)
            except:
                pass
        col[i][1].append(data)
        i += 1

Dict = {title: column for (title, column) in col}
Dict['Link'] = links
df = pd.DataFrame(Dict)

我在最后得到了这个：

【讨论】：