【问题标题】:Read Urls from another file and Scrape the data- Bash从另一个文件中读取 URL 并抓取数据 - Bash
【发布时间】:2020-06-07 10:58:37
【问题描述】:

我想从 URL.txt 中获取 URL,然后将它们附加到另一个文件 menu.sh 中存在的基本 URL https://www.mcdelivery.com.pk/pk/browse/menu.html 的末尾

Url.text 文件包含

?daypartId=1&catId=1
?daypartId=1&catId=2
?daypartId=1&catId=11
?daypartId=1&catId=10
?daypartId=1&catId=6
?daypartId=1&catId=4
?daypartId=1&catId=14
?daypartId=1&catId=5
?daypartId=1&catId=3
?daypartId=1&catId=8

我想附加像https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=11这样的网址 来自 URL.txt 文件的基本 url + url

我想出了这段代码,但问题是我只从第一页获取价格,并且它不断重复同一页面的值,直到循环结束。

ARRAY=()
while read -r LINE
do
ARRAY+=("$LINE")
done < URL.txt
for LINE in "${ARRAY[@]}"
do   
echo $LINE
curl https://www.mcdelivery.com.pk/pk/browse/menu.html$LINE | grep -o '<span class="starting-price">.*</span>' | sed 's/<[^>]\+>//g' >> price.txt 
done

我得到的输出

Rs 398
Rs 487
Rs 841
Rs 752
Rs 398
Rs 398
Rs 487
Rs 841
Rs 752
....

我想从每个页面获取价格并将它们存储到 price.txt

【问题讨论】:

  • 大概,你只需要引用网址即可。
  • 你能告诉我怎么做吗?
  • 引用 = 写 "$LINE" 而不是 $LINE,另见 stackoverflow.com/q/29378566/6770384。 ¶ 但是,我认为这不会导致您描述的问题 »I only get the price from the first page«。
  • 我无法重现您的问题。首先,无论我选择哪个 catId,所有 URL 都会为我返回相同的页面。然后,对这些页面进行 grepping 总是会返回 McArabia with Drink 之类的东西,但绝不会像 Rs398487 之类的东西。

标签: arrays bash web-scraping readfile web-scraping-language


【解决方案1】:

Please don't use regular expressions to parse html。请改用真正的 html-parser / web-scraper,例如 xidel
事实上,根本不需要 Bash 脚本。 xidel 可以为所欲为。

解析“★What's New★”的html-menu-item和string-join price + product-name:

$ xidel -s "https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=1" -e '
  //div[ends-with(@class,"panel-product")]/join(
    (.//span[@class="starting-price"],.//h5),
    " - "
  )
'
Rs 288 - Cappuccino with Milk Chocolate Cookie
Rs 288 - Cappuccino with Double Chocolate Cookie
Rs 288 - Latte with Milk Chocolate Cookie
[...]
Rs 239 - Salted Caramel Shake

列出所有菜单项和字符串连接 url + 标题:

$ xidel -s https://www.mcdelivery.com.pk/pk/browse/menu.html -e '
  //ul[@class="secondary-menu"]//a/join((resolve-uri(@href),span)," - ")
'
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=12 - Deals
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=1 - ★What's New★
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=2 - Ala carte & Value Meals
[...]
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=8 - Snack Time

对于每个菜单项,字符串连接 url + 标题,打开 url / 解析 html 和字符串连接价格 + 产品名称:

$ xidel -s https://www.mcdelivery.com.pk/pk/browse/menu.html -e '
  //ul[@class="secondary-menu"]//a/(
    join((resolve-uri(@href),span)," - "),
    doc(@href)//div[ends-with(@class,"panel-product")]/join(
      (.//span[@class="starting-price"],.//h5),
      " - "
    )
  )
'
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=12 - Deals
Rs 487 - Grand Chicken Spicy with Drink
Rs 398 - Big Mac + Regular Drink
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=1 - ★What's New★
Rs 288 - Cappuccino with Milk Chocolate Cookie
Rs 288 - Cappuccino with Double Chocolate Cookie
Rs 288 - Latte with Milk Chocolate Cookie
Rs 288 - Latte with Double Chocolate Cookie
Rs 159 - McFizz Guava
Rs 195 - Date Pie
Rs 416 - Spicy McCrispy Deluxe - Regular Meal
Rs 416 - McChicken - Regular Meal
Rs 239 - Curly Fries
Rs 239 - Salted Caramel Shake
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=2 - Ala carte & Value Meals
Rs 257 - Chicken Burger with Cheese
Rs 265 - Value McArabia Chicken
Rs 265 - Mini McRoyale
[...]
https://www.mcdelivery.com.pk/pk/browse/menu.html?daypartId=1&catId=8 - Snack Time
Rs 301 - Spicy Chicken Burger
Rs 301 - 4pcs McNuggets
Rs 301 - Fries & Drink
Rs 150 - Apple Pie with Tea

【讨论】:

    【解决方案2】:
    #!/bin/bash
    curl -sL https://www.mcdelivery.com.pk/pk/browse/menu.html | grep -o '<li class="secondary-menu-item ">.*</li>' | sed 's/href=/\nhref=/g' | \
    grep 'href=\"' | \
    sed 's/.*href="//g;s/".*//g' > URL.txt
    sed -i 's/amp;//' URL.txt
    
    ARRAY=()
    while read -r LINE
    do
        ARRAY+=("$LINE")
    done < URL.txt
    
    for LINE in "${ARRAY[@]}"
    do    
        echo $LINE
        curl https://www.mcdelivery.com.pk/pk/browse/menu.html"$LINE" | grep -o '<h5 class="product-title">.*</h5>' | sed 's/<[^>]\+>//g' >> name.txt
        curl https://www.mcdelivery.com.pk/pk/browse/menu.html"$LINE" | grep -o '<span class="starting-price">.*</span>' | sed 's/<[^>]\+>//g' >> price.txt 
    done    
    
    

    这是我的问题的答案。谢谢大家的帮助!!!

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-03-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-23
      • 2020-09-25
      • 2021-08-01
      • 1970-01-01
      相关资源
      最近更新 更多