【问题标题】:Create a dataframe from HTML Tags从 HTML 标签创建数据框
【发布时间】:2020-08-16 11:50:54
【问题描述】:

我尝试将其作为链接单独完成,然后作为日期完成,但我遇到了数据帧计数与 anr 字符串不匹配的问题,以弄清楚如何合并 2 个列表。我决定同时提取链接和日期,但现在我无法得到任何结果。

我的数据框应该只有链接和年月报告

这是一个html示例

<tr>
 <td headers="view-dlf-1-title-table-column--G7-URXF07Ms" class="views-field views-field-dlf-1-title">
 <a href="/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report-Items/Contract-Summary-2013-03">Contract Summary</a>          </td>
 <td headers="view-dlf-2-report-period-table-column--G7Rqagd92Ho" class="views-field views-field-dlf-2-report-period">2013-03          </td>
 </tr>	

这是我当前的代码

import pandas as pd
from datetime import datetime
from lxml import html
import requests

def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """

    if payload is None:
      payload = {}

    if session:
       content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
    else:
       content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})

    content.raise_for_status()  # Raise HTTPError for bad requests (4xx or 5xx)

    if parse:
       return html.fromstring(content.text), content.url
    else:
       return content.text, content.url

def get_html(link):
  """
  Returns a html.
  """
   page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
   return page_parsed


cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']

for cmslink in cmslinks:
   content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
   table = content.cssselect('table[class="views-table views-view-table cols-2"]')[0]
   links = content.cssselect('td[headers="view-dlf-1-title-table-column"]')
   urls = [row.get('href') for row in links]         
   date = [dict(zip('ReportTime', row.xpath('td//text()'))) for row in table[0:]]
   df1 = pd.DataFrame(urls) 
   df2 = pd.DataFrame(date) 
   mergedDf = df2.merge(df1, left_index=True, right_index=True)

【问题讨论】:

  • 您能澄清一下问题所在吗?请参阅How to Askhelp center
  • 我的数据框返回空白。我显然没有正确执行 xpath。我没有收到错误,但我没有收到数据
  • 啊,你调试了吗?
  • 是的,links 变量被填充,但 url 变量返回为 none 每一行
  • 这一行:urls = [row.get('href') for row in links]

标签: python pandas web-scraping lxml


【解决方案1】:

试试这个:

import pandas as pd
from datetime import datetime
from lxml import html
import requests

def http_request_get(url, session=None, payload=None, parse=True):
    """ Sends a GET HTTP request to a website and returns its HTML content and full url address. """

    if payload is None:
      payload = {}

    if session:
       content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
    else:
       content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})

    content.raise_for_status()  # Raise HTTPError for bad requests (4xx or 5xx)

    if parse:
       return html.fromstring(content.text), content.url
    else:
       return content.text, content.url

def get_html(link):
    """
    Returns a html.
    """
    page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
    return page_parsed


cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1'
]

for cmslink in cmslinks:
   content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
   table = content.cssselect('table[class="views-table views-view-table cols-2"]')
   links = content.cssselect('td[headers="view-dlf-1-title-table-column"]')
   urls = [row.xpath("//a[contains(text(),'Enrollment by Contract')]/@href") for row in links]
   date = [dict(zip('ReportTime', row.xpath("//td[@class='views-field views-field-dlf-2-report-period']"))) for row in table[0:]]
   df1 = pd.DataFrame(urls)
   df2 = pd.DataFrame(date)
   mergedDf = df2.merge(df1, left_index=True, right_index=True)

full_table=pd.DataFrame()
for cmslink in cmslinks:
   content, _ = http_request_get(url=cmslink, payload={'t': ''}, parse=True)
   table=pd.read_html(cmslink)[0]
   links = content.cssselect('td[headers="view-dlf-1-title-table-column"]')
   urls = links[0].xpath("//td/a[contains(text(),'')]/@href")
   table['Title']=urls
   full_table=full_table.append(table)

print(full_table)

输出: - 166 行 x 2 列

【讨论】:

  • 感谢 0buz 的回复我得到错误这里是错误
  • TypeError Traceback(最近一次调用最后一次)c:\Users\ltorres\Documents\DataScience\CMS\CMSDataPull.py 在 166 内容中,_ = http_request_get(url=cmslink, payload={'t' : ''}, parse=True) 167 table=pd.read_html(cmslink)[0] ---> 168 links = content.cssselect('td[headers="view-dlf-1-title-table-column" ]')[0] 169 urls = links[0].xpath("//td/a[contains(text(),'')]/@href") 170 table['Title']=urls TypeError: ' NoneType' 对象不可调用
  • 嗯,我唯一能想到的就是格式化,我稍微改变了一下..你能不能再次从我的代码中复制/粘贴 for 循环并运行?
  • 在循环中重复附加到 DataFrame 是一个坏主意。最好使用一些中间数据结构,一次创建整个 DF。
  • [[row.get('href'), row.find_next('td').text.strip()] 如果 row.get 中的“Enrollment-by-Contract”,则链接中的行('href')]
【解决方案2】:

我会选择 BeautifulSoup。使用它来解析 html 非常容易。然后只需获取具有href&lt;a&gt; 标签(特别是"Enrollment-by-Contract" 链接)。然后从这些元素中获取下一个 &lt;td&gt; 标记,用于下一个表格单元格中的文本。

import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from lxml import html
import requests

def http_request_get(url, session=None, payload=None, parse=True):
    """ Sends a GET HTTP request to a website and returns its HTML content and full url address. """

    if payload is None:
      payload = {}

    if session:
       content = session.get(url, params=payload, verify=False, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',"content-type":"text"})
    else:
       content = requests.get(url, params=payload, verify=False, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',"content-type":"text"})

    content.raise_for_status()  # Raise HTTPError for bad requests (4xx or 5xx)

    if parse:
       return BeautifulSoup(content.text, 'html.parser'), content.url
    else:
       return content.text, content.url

def get_html(link):
  """
  Returns a html.
  """
  page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
  return page_parsed


cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']

df = pd.DataFrame()
for cmslink in cmslinks:
   content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
   table = content.find('table')
   links = table.find_all('a', href=True)
   urls = [[row.get('href'), row.find_next('td').text.strip()] for row in links if 'Enrollment-by-Contract' in row.get('href')]         
   df = df.append(pd.DataFrame(urls), sort=False).reset_index(drop=True)

输出:

print (df)
                                                     0        1
0    /Research-Statistics-Data-and-Systems/Statisti...  2019-10
1    /Research-Statistics-Data-and-Systems/Statisti...  2019-09
2    /Research-Statistics-Data-and-Systems/Statisti...  2019-08
3    /Research-Statistics-Data-and-Systems/Statisti...  2019-07
4    /Research-Statistics-Data-and-Systems/Statisti...  2019-06
5    /Research-Statistics-Data-and-Systems/Statisti...  2019-05
6    /Research-Statistics-Data-and-Systems/Statisti...  2019-04
7    /Research-Statistics-Data-and-Systems/Statisti...  2019-03
8    /Research-Statistics-Data-and-Systems/Statisti...  2019-02
9    /Research-Statistics-Data-and-Systems/Statisti...  2019-01
10   /Research-Statistics-Data-and-Systems/Statisti...  2018-12
11   /Research-Statistics-Data-and-Systems/Statisti...  2018-11
12   /Research-Statistics-Data-and-Systems/Statisti...  2018-10
13   /Research-Statistics-Data-and-Systems/Statisti...  2018-09
14   /Research-Statistics-Data-and-Systems/Statisti...  2018-08
15   /Research-Statistics-Data-and-Systems/Statisti...  2018-07
16   /Research-Statistics-Data-and-Systems/Statisti...  2018-06
17   /Research-Statistics-Data-and-Systems/Statisti...  2018-05
18   /Research-Statistics-Data-and-Systems/Statisti...  2018-04
19   /Research-Statistics-Data-and-Systems/Statisti...  2018-03
20   /Research-Statistics-Data-and-Systems/Statisti...  2018-02
21   /Research-Statistics-Data-and-Systems/Statisti...  2018-01
22   /Research-Statistics-Data-and-Systems/Statisti...  2017-12
23   /Research-Statistics-Data-and-Systems/Statisti...  2017-11
24   /Research-Statistics-Data-and-Systems/Statisti...  2017-10
25   /Research-Statistics-Data-and-Systems/Statisti...  2017-09
26   /Research-Statistics-Data-and-Systems/Statisti...  2017-08
27   /Research-Statistics-Data-and-Systems/Statisti...  2017-07
28   /Research-Statistics-Data-and-Systems/Statisti...  2017-06
29   /Research-Statistics-Data-and-Systems/Statisti...  2017-05
..                                                 ...      ...
129  /Research-Statistics-Data-and-Systems/Statisti...  2008-12
130  /Research-Statistics-Data-and-Systems/Statisti...  2008-11
131  /Research-Statistics-Data-and-Systems/Statisti...  2008-10
132  /Research-Statistics-Data-and-Systems/Statisti...  2008-09
133  /Research-Statistics-Data-and-Systems/Statisti...  2008-08
134  /Research-Statistics-Data-and-Systems/Statisti...  2008-07
135  /Research-Statistics-Data-and-Systems/Statisti...  2008-06
136  /Research-Statistics-Data-and-Systems/Statisti...  2008-05
137  /Research-Statistics-Data-and-Systems/Statisti...  2008-04
138  /Research-Statistics-Data-and-Systems/Statisti...  2008-03
139  /Research-Statistics-Data-and-Systems/Statisti...  2008-02
140  /Research-Statistics-Data-and-Systems/Statisti...  2008-01
141  /Research-Statistics-Data-and-Systems/Statisti...  2007-12
142  /Research-Statistics-Data-and-Systems/Statisti...  2007-11
143  /Research-Statistics-Data-and-Systems/Statisti...  2007-10
144  /Research-Statistics-Data-and-Systems/Statisti...  2007-09
145  /Research-Statistics-Data-and-Systems/Statisti...  2007-08
146  /Research-Statistics-Data-and-Systems/Statisti...  2007-07
147  /Research-Statistics-Data-and-Systems/Statisti...  2007-06
148  /Research-Statistics-Data-and-Systems/Statisti...  2007-05
149  /Research-Statistics-Data-and-Systems/Statisti...  2007-04
150  /Research-Statistics-Data-and-Systems/Statisti...  2007-03
151  /Research-Statistics-Data-and-Systems/Statisti...  2007-02
152  /Research-Statistics-Data-and-Systems/Statisti...  2007-01
153  /Research-Statistics-Data-and-Systems/Statisti...  2006-12
154  /Research-Statistics-Data-and-Systems/Statisti...  2006-11
155  /Research-Statistics-Data-and-Systems/Statisti...  2006-10
156  /Research-Statistics-Data-and-Systems/Statisti...  2006-09
157  /Research-Statistics-Data-and-Systems/Statisti...  2006-08
158  /Research-Statistics-Data-and-Systems/Statisti...  2012-11

[159 rows x 2 columns]

【讨论】:

  • 谢谢 ChiTow88 --- 这里的问题是为什么缺少 2019-11 到 2020-04 月份。这是我看到的唯一问题。两者的“合同注册”文本相同
  • 顺便说一下,总行数应该是 166。也许是 165 出于某种原因我在 2016 年 4 月的链接中遇到了问题。无法弄清楚有什么不同。最坏的情况我会手动添加 1
  • 在循环中重复附加到 DataFrame 是一个坏主意。最好使用一些中间数据结构,一次创建整个 DF。
  • @AMC,很公平。这是一个相当容易实现的更改,所以我会回去更改它。你能详细解释一下原因吗?
  • @chitown88 有一些关于主题here 的信息,例如。这个答案现在有点老了,所以我会尝试自己运行一些基准测试,尽管我认为不会有太大变化。
猜你喜欢
  • 2014-05-16
  • 1970-01-01
  • 2022-10-12
  • 2020-08-10
  • 2021-12-11
  • 1970-01-01
  • 2011-06-27
  • 1970-01-01
  • 2021-02-02
相关资源
最近更新 更多