【问题标题】:Why Does This Scrape Stop After 1st Iteration?为什么在第 1 次迭代后这个 Scrape 会停止?
【发布时间】:2021-10-12 14:44:11
【问题描述】:

我的代码访问一个页面,其中每一行可能有也可能没有下拉菜单,其中包含更多信息。

我有一个 try 和 except 语句来检查这个。

在第 1 行工作正常,但在第 2 行不行?

import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

gg=[]
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')

for session in sessions:
    jj=(session.select_one('h4').text)
    print(jj)
    sub_session = session.select('.sub_accordin_presentation')
    try:
        if sub_session:
            kk=([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
            print(kk)
    except:
        kk=' '
    dict={"Title":jj,"Sub":kk}
    gg.append(dict)

df=pd.DataFrame(gg)
df.to_csv('test2.csv')

【问题讨论】:

    标签: web-scraping beautifulsoup request css-selectors re


    【解决方案1】:

    要获取所有部分 + 子部分,请尝试:

    import requests
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    
    r = requests.get(
        "https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2"
    )
    soup = bs(r.text, "lxml")
    sessions = soup.select("#accordin > ul > li")
    
    gg = []
    for session in sessions:
        jj = session.h4.get_text(strip=True, separator=" ")
        sub_sessions = session.select(".sub_accordin_presentation")
    
        if sub_sessions:
            for sub_session in sub_sessions:
                gg.append(
                    {
                        "Title": jj,
                        "Sub": sub_session.h4.get_text(strip=True, separator=" "),
                    }
                )
        else:
            gg.append(
                {
                    "Title": jj,
                    "Sub": "None",
                }
            )
    
    
    df = pd.DataFrame(gg)
    df.to_csv("data.csv", index=False)
    print(df)
    

    打印:

                                                                                                                                                                                                        Title                                                                                                                                                      Sub
    0                                                                                            IS05 - Industry Symposium Sponsored by Amgen: Advancing Lung Cancer Treatment with Novel Therapeutic Targets                                                                                                                                                     None
    1                                 IS06 - Industry Symposium Sponsored by Jazz Pharmaceuticals: Exploring a Treatment Option for Patients with Previously Treated Metastatic Small Cell Lung Cancer (SCLC)                                                                                                                                                     None
    2                                                                                      IS07 - Satellite CME Symposium by Sanofi Genzyme: On the Frontline: Immunotherapeutic Approaches in Advanced NSCLC                                                                                                                                                     None
    3                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                          PL02A.01 - Durvalumab ± Tremelimumab + Chemotherapy as First-line Treatment for mNSCLC: Results from the Phase 3 POSEIDON Study
    4                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                                                                                                                                    PL02A.02 - Discussant
    5                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                              PL02A.03 - Lurbinectedin/doxorubicin versus CAV or Topotecan in Relapsed SCLC Patients: Phase III Randomized ATLANTIS Trial
    
    ...
    

    并创建 data.csv(来自 LibreOffice 的屏幕截图):

    【讨论】:

    • 可爱!做的也很整洁!
    • 顺便问一下,能不能得到子会话作者/时间?在他们自己的专栏中?
    • @VoidS 是的,只需在字典中添加另一个键。如果该小节不存在,请不要忘记使用新键添加 "None" 值。
    • 很好,明白了。在您的示例中,您做了 - Sub": sub_session.h4.get_text 其中 h4 是标签名称。我的问题是它是否总是必须是标签名称?我问是因为如果有多个同名标签怎么办。会上课名字也有用吗?
    • @VoidS 这只是您选择标签的一种形式。您可以使用.find.select_one。取决于你。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多