【发布时间】:2019-07-16 01:50:08
【问题描述】:
我一直在尝试从证券交易所的表格中获取一些信息 (https://www.idx.co.id/en-us/listed-companies/company-profiles/)
使用 python(lxml、requests 和 pandas) 这是我使用的参考:
https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059
由于我是 python/编程的绝对新手,也许有人知道如何仅在 tablebody 中的行元素上应用 .xpath 然后提取内容?我也研究过使用 bs4/beautifulsoup,但也没有让它工作。非常感谢任何帮助或建议!感谢您的宝贵时间
我的代码
from lxml import html as lh
import requests
import pandas as pd
#create a handle page to handle the contents of the website
page = requests.get('http://www.idx.co.id/en-us/listed-companies/company-profiles/')
# stores contents under doc
doc = lh.fromstring(page.content)
#parses data stored in between <tr>..<tr> of the html
tr_elements = doc.xpath('//*[@id="companyTable"]/tbody')
#create empty list
col = []
i = 0
for j in range(0,len(tr_elements)):
#T is our j'th row
T = tr_elements[j]
#If row is not of size 4, the //tr data is not from our table
if len(T)!=4:
break
# i is column index
i=0
# Iterate through each element of the row
for t in T.iterchildren():
data = t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
[len(C) for (title,C) in col] # checking no of values in all columns
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
print(df)
打印输出(df)
Empty DataFrame
Columns: []
Index: []
预期输出:
Columns: [No, Code, Name, Listing Date]
Index: [1, AALI, Astra Agro Lestari Tbk, 09 Dec 1997]
【问题讨论】: