信息提取 - htmls答案

【问题标题】：Info extraction - htmls信息提取 - htmls
【发布时间】：2020-01-23 13:53:30
【问题描述】：

我有一堆具有相同设置的 HTML 文件。从这些（本地存储的 HTML）中，我想提取黄色标记的字段（example）。作为文本（只有我感兴趣的 div 部分），可以在 Dropbox 上找到总 html： https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0

<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P> 
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>

我对 Python 了解不多，但我认为使用 Beautiful soup 这应该是双倍的，但我被困住了。到目前为止我得到的是：

import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

我的输出应该是一个 csv 文件，其中包含：行政人员姓名/行政人员职能/代码代码/期间

【问题讨论】：

你能把输入的html文件分享为文本而不是图像吗？
@Alderven 完成。

标签： python html pandas csv

【解决方案1】：

以下代码从黄色位置提取文本。

我认为最简单的方法是使用 XPath。据了解 bs4 不支持 XPath，因此代码使用 lxml。我希望这种差异对您有用。输出文件名为 egg.csv

为了让它适合你，改变目录变量。

*这适用于 Windows。在其他平台中，您必须更改“目录”变量的形式。

import textwrap
import os
from lxml import html
import csv

directory=r"C:\Users\Anita Pania\Desktop"
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            page=f.read()
            tree = html.fromstring(page)
            y1=(tree.xpath("/html/body/div/p[1]/a/text()"))
            y2=(tree.xpath("/html/body/div/p[2]/text()"))[0]
            y3=(tree.xpath("/html/body/div/p[5]/text()"))
            y4=(tree.xpath("/html/body/div/p[6]/text()"))
            y5=(tree.xpath("/html/body/div/p[7]/a/text()"))
            #soup = BeautifulSoup(f.read(),'html.parser')

with open('eggs.csv', 'w', newline='') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    filewriter.writerow(['Name of executive', y3])
    filewriter.writerow(['Function of executive', y4])
    filewriter.writerow(['Symbol ticker', y1])
    filewriter.writerow(['Period', y2])
    filewriter.writerow(['Other', y5])

【讨论】：

我在 Windows 上有 python 3.7，所以你的代码应该可以正常工作吗？ @菲尼亚斯
我收到以下错误：文件“C:\Research syntheses - Meta analysis\SeekingAlpha\retrieve text.py”，第 6 行目录=r“C:/Research syntheses - Meta analysis/SeekingAlpha/ out" ^ SyntaxError: 无效的语法 [在 0.1 秒内完成，退出代码为 1]。我的源文件如下：dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0
尝试将 / 替换为 \ 。您可能还需要指定 .html?dl=0 而不是 .html，但我不确定。
用 \ 替换 / 后的输出？
文件 "C:\Research syntheses - Meta analysis\SeekingAlpha\retrieve text.py", line 6 directory=r "C:/Research syntheses - Meta analysis/SeekingAlpha/out" ^ SyntaxError: invalid语法 [在 0.2 秒内完成，退出代码为 1]