【问题标题】:How to get all pages text from pdf pag(links) using selenium IDE and python如何使用 selenium IDE 和 python 从 pdf 页面(链接)获取所有页面文本
【发布时间】:2015-03-13 13:26:31
【问题描述】:

我正在尝试从 PDF 页面获取文本,因为我将使用 XPATH selenium IDE 和 python 逐个点击 pdf 页面链接但它给了我空数据,有时它给了我一页 PDF 内容 页面,但不是特定格式。

如何从pdf链接的所有页面中获取文本?

这是我的代码:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 

url = "http://www.incredibleindia.org"
driver = webdriver.Firefox()
driver.get(url) 
# wait for menu to being loaded
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.menu li > a")))

#article under media tab 
article_link = [a.get_attribute('href') for a in   driver.find_elements_by_xpath("html/body/div[3]/div/div[1]/div[2]/ul/li[3]/ul/li[6]/a")]
#all important news links under trade tab 
for link in article_link:
    print link
    driver.get(link) 
    #check article sublinks css available on article link page
    try:
         WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.article-full-div")))
    except TimeoutException:
         print driver.title, "No news links under media tab"
    #alrticle sub links under article tab 
    article_sub_links = [a.get_attribute('href') for a in   driver.find_elements_by_xpath(".//*[@id='article-content']/div/div[2]/ul/li/a")]

    print "article sub links"
    for link in article_sub_links:
        print link

        driver.get(link)  
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.textLayer")))
        except TimeoutException:
            print driver.title, "No news links under media tab"

        content = [a.text for a in driver.find_elements_by_xpath(".//*[contains(@id,'pageContainer')]")] 
        print content 
        for data in content:
            print data

输出:

http://www.incredibleindia.org/en/media-black-2/articles
article sub links
http://www.incredibleindia.org/images/articles/Ajanta.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Bedhaghat.pdf
404 - Error: 404 No news links under media tab` 
[]
http://www.incredibleindia.org/images/articles/Bellur.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Bidar.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Braj.pdf
[u'', u'', u'', u'']




http://www.incredibleindia.org/images/articles/Carnival.pdf
[u'', u'', u'']`

【问题讨论】:

    标签: python selenium xpath


    【解决方案1】:

    我认为您需要进入“textlayer”(div 元素,每个页面容器内都有class="textlayer")。您还需要在异常处理块中使用continue

    for link in article_sub_links:
        driver.get(link)
    
        try:
            WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.textLayer")))
        except TimeoutException:
            print driver.title, "Empty content"
            continue
    
        content = [a.text for a in driver.find_elements_by_css_selector("div[id^=pageContainer] div.textLayer")]
        for data in content:
            print data
    

    【讨论】:

    • 它正在处理一些 pdf 链接,但不适用于所有 pdf 链接。它正在打印 pdf 链接或某些 pdf 链接的内容,而不是所有 pdf 链接的内容。
    • @user3902208 谢谢,你能提供一个这不起作用的示例链接吗?
    • 输出:http://www.incredibleindia.org/en/media-black-2/articles article sub links http://www.incredibleindia.org/images/articles/Ajanta.pdf http://www.incredibleindia.org/images/articles/Bedhaghat.pdf 404 - Error: 404 Empty content http://www.incredibleindia.org/images/articles/Bellur.pdf http://www.incredibleindia.org/images/articles/Gir.pdf http://www.incredibleindia.org/images/articles/Hampi.pdf http://www.incredibleindia.org/images/articles/Orchha.pdf **It did not show content except this link**
    • @user3902208 不确定是否有帮助,但请尝试将 presence_of_element_located 更改为 visibility_of_element_located
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-03-26
    • 2015-03-25
    • 2012-03-14
    • 2020-02-29
    相关资源
    最近更新 更多