【发布时间】:2021-02-21 18:04:52
【问题描述】:
我一直在尝试提取表格,但它只检索表格的标题。这是我检索表的第一种方法。
url = r"https://www.sec.gov/edgar/search/#/q=Women&dateRange=custom&entityName=Infosys&startdt=2010-03-01&enddt=2020-03-01"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all("table")[1]
#Extracting heading of the columns of the table.
rows = table.find_all('tr')
columns=[]
headings = rows[0].find_all('th')
for col in headings:
columns.append(col.text.strip())
print(columns)
#Extracting all data of the table row wise.
all_data=[]
for row in rows[1:]:
data = row.find_all('td')
lst=[]
for d in data:
lst.append(d.text.strip())
all_data.append(lst)
#Creating the dataframe out of the extracted data.
ds = pd.DataFrame(all_data, columns=columns)
ds
第二种方式:
ds1 = pd.read_html(url)[0]
ds1
当我尝试搜索表格时,我得到了thead标签中的所有列标题,但我得到一个空的tbody。
table = soup.find_all("table", class_='table')
table
输出:
[<table class="table table-hover entity-hints" id="asdf"></table>,
<table class="table">
<thead>
<tr>
<th class="filetype" id="filetype">Form & File</th>
<th class="filed">Filed</th>
<th class="enddate">Reporting for</th>
<th class="entity-name">Filing entity/person</th>
<th class="cik">CIK</th>
<th class="located">Located</th>
<th class="incorporated">Incorporated</th>
<th class="file-num">File number</th>
<th class="film-num">Film number</th>
</tr>
</thead>
<tbody>
</tbody>
</table>]
为什么tbody标签是空的?
桌子截图:
【问题讨论】:
-
这能回答你的问题吗? How to scrape dynamic webpages by Python
标签: python beautifulsoup