【问题标题】:Extracting text from specific paragraphs of the website with Python 2使用 Python 2 从网站的特定段落中提取文本
【发布时间】:2017-07-17 04:07:33
【问题描述】:

我想提取给出报告增长和收缩的行业列表以及受访者所说的内容等的段落(这可以在网页的多个位置找到)。这些段落通常位于表格上方。如何使用 Requests、lxml、BeautifulSoup 解析并选择我需要的段落?

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

我尝试使用 lxml 和 xpath,但每个月网站都会随着新报告而略有变化,并且代码停止工作。

【问题讨论】:

    标签: python parsing beautifulsoup python-requests lxml


    【解决方案1】:

    第三种解决方案是使用Pyquery。它速度很快,并且使用与 Jquery 完全相同的选择器。您可以使用 Chrome Gadget Selector 轻松找到它们。

    那么,就只剩下使用它了。

    from pyquery import PyQuery as pq
    import requests
    
    url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
    content = requests.get(url).content
    doc = pq(content)
    
    respondent = doc(".formatted_content ul").text()
    
    print(respondent)
    

    输出:

    “Demand very steady to start the year.” (Chemical Products) “January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products) “Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics & Rubber Products) “Business looks stronger moving into the first quarter of 2017.” (Primary Metals) “Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage & Tobacco Products) “Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery) “Year starting on pace with Q4 2016.” (Transportation Equipment) “Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing) “Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum & Coal Products) “Steady demand from automotive.” (Fabricated Metal Products)
    

    【讨论】:

    • 是否有可能以字符串形式而不是文本形式获得相同的结果。假设我必须提取子字符串等?
    【解决方案2】:

    这段代码与您使用的代码有多接近?

    它使用正则表达式来识别段落,即受访者所说的事情列表之前的行。然后它只显示结果。

    >>> import requests
    >>> URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
    >>> r = requests.get(URL)
    >>> page = r.text
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(page, 'lxml')
    >>> import re
    >>> paras = soup.find_all('p', string=re.compile('(?:growth)|(?:contraction).*? are\:'))
    >>> saying = soup.find_all('strong', string=re.compile('WHAT RESPONDENTS ARE SAYING'))[0]
    >>> for i, para in enumerate(paras):
    ...     'Paragraph ', i
    ...     para
    ...     
    ('Paragraph ', 0)
    <p>Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics &amp; Rubber Products; Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage &amp; Tobacco Products; Machinery; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Computer &amp; Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture &amp; Related Products; Electrical Equipment, Appliances &amp; Components; and Printing &amp; Related Support Activities.</p>
    ('Paragraph ', 1)
    <p>The 12 industries reporting growth in new orders in January — listed in order — are: Plastics &amp; Rubber Products; Apparel, Leather &amp; Allied Products; Miscellaneous Manufacturing; Chemical Products; Paper Products; Transportation Equipment; Electrical Equipment, Appliances &amp; Components; Petroleum &amp; Coal Products; Primary Metals; Machinery; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in new orders during January are: Nonmetallic Mineral Products; Wood Products; Textile Mills; Computer &amp; Electronic Products; and Furniture &amp; Related Products.</p>
    ('Paragraph ', 2)
    <p>The 10 industries reporting growth in production during the month of January — listed in order — are: Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Petroleum &amp; Coal Products; Plastics &amp; Rubber Products; Transportation Equipment; Chemical Products; Machinery; Food, Beverage &amp; Tobacco Products; and Computer &amp; Electronic Products. The five industries reporting a decrease in production during January are: Wood Products; Textile Mills; Nonmetallic Mineral Products; Electrical Equipment, Appliances &amp; Components; and Furniture &amp; Related Products.</p>
    ('Paragraph ', 3)
    <p>Of the 18 manufacturing industries, the 10 reporting employment growth in January — listed in order — are: Textile Mills; Paper Products; Food, Beverage &amp; Tobacco Products; Machinery; Electrical Equipment, Appliances &amp; Components; Chemical Products; Miscellaneous Manufacturing; Transportation Equipment; Computer &amp; Electronic Products; and Nonmetallic Mineral Products. The five industries reporting a decrease in employment in January are: Plastics &amp; Rubber Products; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Printing &amp; Related Support Activities. </p>
    ('Paragraph ', 4)
    <p>The seven industries reporting growth in order backlogs in January — listed in order — are: Wood Products; Plastics &amp; Rubber Products; Electrical Equipment, Appliances &amp; Components; Primary Metals; Fabricated Metal Products; Miscellaneous Manufacturing; and Chemical Products. The seven industries reporting a decrease in order backlogs during January — listed in order — are: Nonmetallic Mineral Products; Textile Mills; Paper Products; Computer &amp; Electronic Products; Food, Beverage &amp; Tobacco Products; Transportation Equipment; and Furniture &amp; Related Products.</p>
    ('Paragraph ', 5)
    <p>The eight industries reporting growth in new export orders in January — listed in order — are: Wood Products; Paper Products; Petroleum &amp; Coal Products; Chemical Products; Fabricated Metal Products; Transportation Equipment; Miscellaneous Manufacturing; and Food, Beverage &amp; Tobacco Products. The four industries reporting a decrease in new export orders during January are: Textile Mills; Nonmetallic Mineral Products; Plastics &amp; Rubber Products; and Machinery. Six industries reported no change in new export orders in January compared to December.</p>
    ('Paragraph ', 6)
    <p>The four industries reporting growth in imports during the month of January are: Furniture &amp; Related Products; Apparel, Leather &amp; Allied Products; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in imports during January are: Plastics &amp; Rubber Products; Primary Metals; Nonmetallic Mineral Products; Transportation Equipment; and Computer &amp; Electronic Products. Eight industries reported no change in imports in January compared to December.</p>
    >>> saying.findNextSibling()
    <ul style="list-style-type: square;">
    <li>“Demand very steady to start the year.” (Chemical Products)</li>
    <li>“January revenue target slightly lower following a big December shipment month.” (Computer &amp; Electronic Products)</li>
    <li>“Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics &amp; Rubber Products)</li>
    <li>“Business looks stronger moving into the first quarter of 2017.” (Primary Metals)</li>
    <li>“Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage &amp; Tobacco Products)</li>
    <li>“Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery)</li>
    <li>“Year starting on pace with Q4 2016.” (Transportation Equipment)</li>
    <li>“Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing)</li>
    <li>“Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum &amp; Coal Products)</li>
    <li>“Steady demand from automotive.” (Fabricated Metal Products)</li>
    </ul>
    >>> 
    

    【讨论】:

    • 使用正则表达式解析和提取定期更新的网站是个好主意吗?我已经阅读了几篇强烈批评为此目的使用正则表达式的帖子:stackoverflow.com/a/1732454/4399016 我使用了 LXML、XPATH 和 Urllib2。就在下个月发布最新报告时,它崩溃了。无论如何,感谢您的努力。
    • 我们只能回答像您这样的人提出的问题。您暗示“增长”和“收缩”是这些页面中的关键词。我想你会发现问题不在于使用正则表达式,而在于规律性。
    • 我明白你的意思。我需要捕获页面上的不同段落,其内容会定期更改。实际上,我等了 1 个月的新报告只是为了看看代码是否健壮。
    • 正则表达式无法解析 HTML:完全正确。但是解析 HTML 和在文本中提取包含某些单词的某些段落是有区别的。为此,正则表达式是一个很好的工具。正如比尔所说,问题在于规律性。如果您想要的段落总是包含一些精确的单词,那么正则表达式是合适的。但如果它们始终是主表上方的第一个 HTML 列表,那么 CSS 或 Xpath 选择器将更加健壮。需要几个文本示例来识别模式,然后选择最佳方法。
    • 关于Xpath的最后一点:如果你使用Chrome Web开发者,它会告诉你你想要的列表的Xpath是//*[@id="home_feature_container"]/div/div[2]/div/ul这实际上是它当前的位置,但可能不是下个月.这可能就是你的刮刀只工作一次的原因。
    猜你喜欢
    • 2019-08-14
    • 1970-01-01
    • 2021-12-30
    • 1970-01-01
    • 1970-01-01
    • 2015-02-22
    • 2021-09-04
    • 1970-01-01
    • 2016-03-11
    相关资源
    最近更新 更多