【问题标题】:How to write regex for xpath in scrapy?如何在scrapy中为xpath编写正则表达式?
【发布时间】:2018-09-10 12:04:06
【问题描述】:

我是 scrapy 的新手,并使用它在网页上查找问题和答案。我从page开始以下内容@

我通过观察它们的 xpath 来尝试使用选择器:

sel = Selector(text=response.body)
spanList = (sel.xpath('//a/span').extract())

但是这样做时我得到了一些重复的内容,我以这种方式得到输出

"<span>How do I access my account online at Citibank Online?</span>",
"<span>What are the guidelines for creating an internet password?</span>",
"<span>I forgot my User ID for accessing my account online. How do I access my account online now?</span>",
"<span>How do I transfer funds to another bank account in India?</span>",
"<span>How do I transfer funds to my Rupee Checking Account from overseas?</span>",
"<span>How do I transfer funds from my Rupee Checking Account to my local bank account overseas?</span>",
"<span>How do I update my contact information?</span>",
"<span>I have not operated my Rupee Checking Account for a long time and I plan to visit India. Can I transact on my account when I visit India?</span>",
"<span>My Term Deposits with Citibank are due to mature soon. What do I need to do?</span>",
"<span>I would like to terminate my Term Deposits before maturity? Will I lose any money?</span>",
"<span>Why do I need to provide \"Customer Profile Update\" forms so often?</span>",
"<span>How do I access my account online at Citibank Online?</span>",
"<span>What are the guidelines for creating an internet password?</span>",
"<span>I forgot my User ID for accessing my account online. How do I access my account online now?</span>",
..................

如果您观察我发布的输出部分,第一和第三跨度再次重复。

有什么方法可以编写一个好的正则表达式来获取内容而不重复。

我提到的页面中问题的示例 xpaths 是

/html/body/div1/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div1/div[3 ]/div1/a/span

/html/body/div1/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div1/div[5 ]/div[5]/div1/a/span

/html/body/div1/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div1/div[5 ]/div1/div1/a/span

【问题讨论】:

    标签: python-3.x web-scraping scrapy scrapy-spider


    【解决方案1】:

    看看这个,

    points = response.xpath('//*[@class="ClsInnerDrop"]//span/text()').extract()
    pointes = set(points)
    

    【讨论】:

    • 它没有保留输出的顺序,如何在获取期间保留顺序?
    • scrapy 是异步的。它不会按顺序工作。您在刮削时添加了某种盐,需要在加工时去除。
    • @Satyaaditya set 可能导致编写您自己的算法,该算法可以完全按照设置工作而不会丢失元素的顺序
    • 是的,set 是我算法中的罪魁祸首,我改变了它。 scrapy中有没有返回节点而不是文本的方法?我尝试使用标签删除文本及其返回字符串
    • @Satyaaditya M 没有得到你想要的
    猜你喜欢
    • 1970-01-01
    • 2018-01-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-11-06
    • 1970-01-01
    • 2020-06-12
    • 1970-01-01
    相关资源
    最近更新 更多