【发布时间】:2020-08-16 11:50:54
【问题描述】:
我尝试将其作为链接单独完成,然后作为日期完成,但我遇到了数据帧计数与 anr 字符串不匹配的问题,以弄清楚如何合并 2 个列表。我决定同时提取链接和日期,但现在我无法得到任何结果。
我的数据框应该只有链接和年月报告
这是一个html示例
<tr>
<td headers="view-dlf-1-title-table-column--G7-URXF07Ms" class="views-field views-field-dlf-1-title">
<a href="/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report-Items/Contract-Summary-2013-03">Contract Summary</a> </td>
<td headers="view-dlf-2-report-period-table-column--G7Rqagd92Ho" class="views-field views-field-dlf-2-report-period">2013-03 </td>
</tr>
这是我当前的代码
import pandas as pd
from datetime import datetime
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
for cmslink in cmslinks:
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
table = content.cssselect('table[class="views-table views-view-table cols-2"]')[0]
links = content.cssselect('td[headers="view-dlf-1-title-table-column"]')
urls = [row.get('href') for row in links]
date = [dict(zip('ReportTime', row.xpath('td//text()'))) for row in table[0:]]
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(date)
mergedDf = df2.merge(df1, left_index=True, right_index=True)
【问题讨论】:
-
您能澄清一下问题所在吗?请参阅How to Ask、help center。
-
我的数据框返回空白。我显然没有正确执行 xpath。我没有收到错误,但我没有收到数据
-
啊,你调试了吗?
-
是的,links 变量被填充,但 url 变量返回为 none 每一行
-
这一行:urls = [row.get('href') for row in links]
标签: python pandas web-scraping lxml