【问题标题】:string text to CSV字符串文本到 CSV
【发布时间】:2023-04-02 21:00:01
【问题描述】:

我想将字符串格式化为 CSV。我使用 BeautifulSoup 从网站上抓取数据并获取完整的字符串。

结果报废:

Business Objective\n
464 Wholesale of household goods\n
Main Business Activities\n
46493 Wholesale of stationery, books, magazines and newspapers\n

我尝试了很多方法:

  1. result = re.findall(r'(?==Business Objective=)(.*)(?=Main Business Activities=)', string)

  2. 使用连接

    3.使用字符串替换

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
import  time
import re
import numpy
import csv
companyName = "MONUMENT BOOKS CO  LTD"
SourceAppCode = "-- Any register --"
browser = webdriver.Chrome("D:\KHIHORT_PROJECTS\YUON_LOTO\chromedriver_win32\chromedriver")
browser.get('https://www.businessregistration.moc.gov.kh/cambodia-master/relay.html?url=https%3A%2F%2Fwww.businessregistration.moc.gov.kh%2Fcambodia-master%2Fservice%2Fcreate.html%3FtargetAppCode%3Dcambodia-master%26targetRegisterAppCode%3Dcambodia-br-companies%26service%3DregisterItemSearch&target=cambodia-master')
browser.find_elements_by_xpath("//input[@name='QueryString']")[0].send_keys(companyName)
time.sleep(0.5)
browser.find_elements_by_xpath("//select[@name='SourceAppCode']")[0].send_keys(SourceAppCode)
time.sleep(0.5)
browser.find_elements_by_xpath("/html[1]/body[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/form[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]/a[3]")[0].click()
time.sleep(0.5)
browser.find_elements_by_xpath("//a[@class='registerItemSearch-results-page-line-ItemBox-resultLeft-viewMenu appMenu appMenuItem appMenuDepth0 noSave appItemSearchResult viewInstanceUpdateStackPush appReadOnly appIndex0']")[0].click()
time.sleep(0.5)
ww=browser.find_elements_by_xpath("/html[1]/body[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/form[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[7]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]")
time.sleep(0.5) 

我的预期结果是:

Business Objective,Main Business Activities
464 Wholesale of household goods,"46493 Wholesale of stationery, books, magazines and newspapers"
"581 Publishing of books, periodicals and other publishing activities","58110 Publishing of books, brochures and other publications(2)"

【问题讨论】:

    标签: python-3.x selenium beautifulsoup


    【解决方案1】:

    最好使用 seleniums 等待功能而不是睡眠。但是您可以将这些行取出,放入数据框中,然后写入 csv:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    import  time
    import re
    import numpy
    import csv
    companyName = "MONUMENT BOOKS CO  LTD"
    SourceAppCode = "-- Any register --"
    browser = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
    browser.get('https://www.businessregistration.moc.gov.kh/cambodia-master/relay.html?url=https%3A%2F%2Fwww.businessregistration.moc.gov.kh%2Fcambodia-master%2Fservice%2Fcreate.html%3FtargetAppCode%3Dcambodia-master%26targetRegisterAppCode%3Dcambodia-br-companies%26service%3DregisterItemSearch&target=cambodia-master')
    browser.find_elements_by_xpath("//input[@name='QueryString']")[0].send_keys(companyName)
    time.sleep(0.5)
    browser.find_elements_by_xpath("//select[@name='SourceAppCode']")[0].send_keys(SourceAppCode)
    time.sleep(0.5)
    browser.find_elements_by_xpath("/html[1]/body[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/form[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]/a[3]")[0].click()
    time.sleep(0.5)
    browser.find_elements_by_xpath("//a[@class='registerItemSearch-results-page-line-ItemBox-resultLeft-viewMenu appMenu appMenuItem appMenuDepth0 noSave appItemSearchResult viewInstanceUpdateStackPush appReadOnly appIndex0']")[0].click()
    time.sleep(0.5)
    ww=browser.find_elements_by_xpath("/html[1]/body[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/form[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[5]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[7]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]")
    time.sleep(0.5) 
    
    
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    ba = soup.find_all('div',{'class':'appRepeaterContent'})[1]
    
    rows = ba.find_all('div',{'class':'appRecordChildren appBlockChildren'})
    
    
    
    results = pd.DataFrame()
    for row in rows:
        bo = row.find('div',{'class':'appAttrValue'})
        mba = bo.findNext('div',{'class':'appAttrValue'})
    
        temp_df = pd.DataFrame([[bo.text, mba.text]], columns=['Business Objective','Main Business Activies'])
        results = results.append(temp_df, sort=True).reset_index(drop=True)
    
    results.to_csv('file.csv', index=False)
    

    输出:

    print (results)
                                       Business Objective                             Main Business Activies
    0                    464 Wholesale of household goods  46493 Wholesale of stationery, books, magazine...
    1   581 Publishing of books, periodicals and other...  58110 Publishing of books, brochures and other...
    2   581 Publishing of books, periodicals and other...  58120 Publishing of mailing lists, telephone b...
    3   581 Publishing of books, periodicals and other...  58130 Publishing of newspapers, journals, maga...
    4   581 Publishing of books, periodicals and other...  58190 Publishing of catalogs, photos, engravin...
    5                 469 Non-specialized wholesale trade  46900 Wholesale of a variety of goods without ...
    6                    464 Wholesale of household goods  46431 Wholesale of pharmaceutical and medical ...
    7                         521 Warehousing and storage             52100 Warehousing and storage services
    8              421 Construction of roads and railways  42101 Construction of streets, roads, bridges ...
    9   681 Real estate activities with own or leased ...  68101 Buying, selling, renting and operating o...
    10                                854 Other education                     85499 Other education n.e.c(6)
    11                                    731 Advertising                               73100 Advertising(1)
    12            551 Short term accommodation activities                     55101 Hotels and resort hotels
    13  561 Restaurants and mobile food service activi...   56101 Restaurants and restaurant cum night clubs
    14     791 Travel agency and tour operator activities                  79110 Travel agency activities(1)
    

    【讨论】:

    • 哇,太酷了。你让我今天一整天都感觉很好。非常感谢您的回答。这是我真正需要的答案。
    猜你喜欢
    • 2014-10-06
    • 1970-01-01
    • 2021-03-28
    • 2016-04-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-24
    • 1970-01-01
    相关资源
    最近更新 更多