【问题标题】:Scraping XML element attributes with beautifulsoup使用 beautifulsoup 抓取 XML 元素属性
【发布时间】:2023-03-09 15:17:01
【问题描述】:

我有以下代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://api.stlouisfed.org/fred/...")
bsObj = BeautifulSoup(html.read(), "lxml");

print(bsObj)

它返回如下内容:

<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
<observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
<observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
<observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
<observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
<observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
<observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
<observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
<observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
<observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
<observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
<observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
<observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
</observations>
</body></html>

我只想提取“日期”和“值”,所以最后我有这样的东西:

1947-04-01 -0.4
1947-07-01 -0.4
1947-10-01 6.4
1948-01-01 6
and so on...

到目前为止,我使用 replace 来抓取文本,并使用 import csv 来抓取 csv 文件:

string = str(bsObj)

string = string.replace("realtime_start=","")
string = string.replace("realtime_end=","")
string = string.replace("observation","")
string = string.replace("date=","")
string = string.replace('"2016-06-22"',"")
string = string.replace("value=","")
string = string.replace("<","")
string = string.replace(">","")
string = string.replace("/","")
string = string.replace('"',"")
print(string)

import csv
with open('test.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter=',')
    data = string
    a.writerows(data)

这几乎是一场灾难。它将文本推入 csv,但每个 simbol 都会进入新行。

我想知道是否有任何更优雅的方式可以提取我需要的内容。例如:

for line in f:
   extract "date" and "value"

或类似的。将其插入 .csv 文件的最合适方法是什么?每次调用此脚本时,我都会重写 .csv 文件。 字段必须用“,”分隔,行用“/n”分隔。

【问题讨论】:

    标签: python python-3.x csv beautifulsoup elementtree


    【解决方案1】:

    上面的效果很好!如果您想使用 URL 而不是本地文件,代码将如下所示:

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    html = urlopen("https://api.stlouisfed.org/fred/series/.......")
    bsObj = BeautifulSoup(html.read(), "lxml");
    
    for ob in bsObj.find_all("observation"):
        print(ob["date"])
        print(ob["value"])
    

    对于 .csv:

    import csv
    with open("out.csv", "w") as f:
        csv.writer(f).writerows((ob["date"], ob["value"])
                                for ob in bsObj.find_all("observation"))
    

    【讨论】:

      【解决方案2】:

      找到所有的属性标签,然后提取你想要的属性:

      x = """<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
      <observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
      <observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
      <observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
      <observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
      <observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
      <observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
      <observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
      <observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
      <observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
      <observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
      <observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
      <observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
      <observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
      <observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
      </observations>
      </body></html>"""
      
      from bs4 import BeautifulSoup
      
      soup = BeautifulSoup(x,"lxml")
      
      for ob in soup.find_all("observation"):
          print(ob["date"])
          print(ob["value"])
      

      这会给你:

      1947-04-01
      -0.4
      1947-07-01
      -0.4
      1947-10-01
      6.4
      1948-01-01
      6
      1948-04-01
      6.7
      1948-07-01
      2.3
      1948-10-01
      0.4
      1949-01-01
      -5.4
      1949-04-01
      -1.3
      1949-07-01
      4.5
      1949-10-01
      -3.5
      1950-01-01
      16.9
      1950-04-01
      12.7
      1950-07-01
      16.3
      

      要写入 csv:

      from bs4 import BeautifulSoup
      import csv
      
      soup = BeautifulSoup(x, "lxml")
      with open("out.csv", "w") as f:
          csv.writer(f).writerows((ob["date"], ob["value"])
                                  for ob in soup.find_all("observation"))
      

      这会给你一个 csv 文件:

      1947-04-01,-0.4
      1947-07-01,-0.4
      1947-10-01,6.4
      1948-01-01,6
      1948-04-01,6.7
      1948-07-01,2.3
      1948-10-01,0.4
      1949-01-01,-5.4
      1949-04-01,-1.3
      1949-07-01,4.5
      1949-10-01,-3.5
      1950-01-01,16.9
      1950-04-01,12.7
      1950-07-01,16.3
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2016-04-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多