【问题标题】:Parsing and extracting data into a pandas dataframe : BeautifulSoup and XML将数据解析并提取到 pandas 数据帧中:BeautifulSoup 和 XML
【发布时间】:2017-01-27 12:31:51
【问题描述】:

我希望你能帮我解决这个问题,所以我需要创建一个解析文本的函数,并将数据提取到 pandas DataFrame 中:

""" 功能 --------- rcp_poll_data

Extract poll information from an XML string, and convert to a DataFrame

Parameters
----------
xml : str
    A string, containing the XML data from a page like 
    get_poll_xml(1044)

Returns
-------
A pandas DataFrame with the following columns:
    date: The date for each entry
    title_n: The data value for the gid=n graph (take the column name from the `title` tag)

This DataFrame should be sorted by date

Example
-------
Consider the following simple xml page:

<chart>
<series>
<value xid="0">1/27/2009</value>
<value xid="1">1/28/2009</value>
</series>
<graphs>
<graph gid="1" color="#000000" balloon_color="#000000" title="Approve">
<value xid="0">63.3</value>
<value xid="1">63.3</value>
</graph>
<graph gid="2" color="#FF0000" balloon_color="#FF0000" title="Disapprove">
<value xid="0">20.0</value>
<value xid="1">20.0</value>
</graph>
</graphs>
</chart>

Given this string, rcp_poll_data should return
result = pd.DataFrame({'date': pd.to_datetime(['1/27/2009', '1/28/2009']), 
                       'Approve': [63.3, 63.3], 'Disapprove': [20.0, 20.0]})

我的代码

def rcp_poll_data(xml):
soup = BeautifulSoup(xml,'xml')
dates=soup.find("series")
datesval=soup.findChildren(string=True)
del datesval[-7:]
obama=soup.find("graph",gid="1")
obamaval={"title":obama["title"],"color":obama["color"]}
romney=soup.find("graph",gid="2")
romneyval={"title":romney["title"],"color":romney["color"]}
result = pd.DataFrame({'date': pd.to_datetime(datesval,errors="ignore"), 'GID1':obamaval, 'GID2':romneyval})
return result 

""" 但是当我执行程序时我不断收到这个错误: 将 dict 与非系列混合可能会导致排序不明确。

请帮忙! PS:get_poll函数是这样的:

def get_poll_xml(poll_id):
url="http://charts.realclearpolitics.com/charts/"+str(poll_id)+".xml"
return requests.get(url).content

以poll_id=1044 为例

【问题讨论】:

    标签: python parsing pandas dataframe beautifulsoup


    【解决方案1】:

    考虑在 BeautifulSoup 上使用内置的 xml.etree.ElementTree(更适合 html 网页抓取)来解析 XML 内容,其中包含 iterfindfindallfind 等方法到 XPath 通过子节点,即使是 @gid='1' 这样的谓词。由于&lt;series&gt;&lt;graph&gt; 父标签中的&lt;value&gt; 元素长度相同,因此您可以在zip() 中循环:

    import requests
    import pandas as pd
    import xml.etree.ElementTree as et
    
    def get_poll_xml(poll_id):
        url="http://charts.realclearpolitics.com/charts/{}.xml".format(poll_id)
        return requests.get(url).content
    
    def rcp_poll_data(xml):
    
        tree = et.fromstring(xml)
    
        dates = []; graphlist1 = []; graphlist2 = []
    
        g1title = tree.find("./graphs/graph[@gid='1']").get('title')
        g2title = tree.find("./graphs/graph[@gid='2']").get('title')
    
        for s, g1, g2 in zip(tree.iterfind("./series/value"),
                             tree.iterfind("./graphs/graph[@gid='1']/value"),
                             tree.iterfind("./graphs/graph[@gid='2']/value")):
            dates.append(s.text)
            graphlist1.append(g1.text)
            graphlist2.append(g2.text)
    
        return pd.DataFrame({'Date':pd.to_datetime(dates, errors="ignore"),
                             g1title: graphlist1,
                             g2title: graphlist2})
    
    poll_id = 1044
    xml_str = get_poll_xml(poll_id)
    df = rcp_poll_data(xml_str)
    

    输出

    print(df.head(20))
    
    #    Approve       Date Disapprove
    # 0     63.3 2009-01-27       20.0
    # 1     63.3 2009-01-28       20.0
    # 2     63.5 2009-01-29       19.3
    # 3     63.5 2009-01-30       19.3
    # 4     61.8 2009-01-31       19.4
    # 5     61.8 2009-02-01       19.4
    # 6     61.8 2009-02-02       19.4
    # 7     61.8 2009-02-03       19.4
    # 8     61.8 2009-02-04       19.4
    # 9     61.8 2009-02-05       19.4
    # 10    61.6 2009-02-06       21.4
    # 11    61.6 2009-02-07       21.4
    # 12    61.6 2009-02-08       21.4
    # 13    65.4 2009-02-09       22.6
    # 14    65.4 2009-02-10       22.6
    # 15    64.2 2009-02-11       23.3
    # 16    64.2 2009-02-12       23.3
    # 17    64.2 2009-02-13       23.3
    # 18    64.8 2009-02-14       25.4
    # 19    65.5 2009-02-15       25.5
    

    【讨论】:

    • 哇,非常感谢,我不知道 xml.etree.ElementTree,感谢您指出这一点!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-08-17
    • 1970-01-01
    • 2022-10-24
    • 1970-01-01
    • 2020-02-09
    • 2020-03-24
    • 1970-01-01
    相关资源
    最近更新 更多