如何解析字符串以查找特定的单词/数字并在找到时显示它们答案

【问题标题】：How do I parse through string looking for specific word/digits and display them if found如何解析字符串以查找特定的单词/数字并在找到时显示它们
【发布时间】：2019-06-30 19:51:03
【问题描述】：

我确信我已经编写了一些相当有问题的代码，但它似乎可以完成这项工作。问题是它正在将数据打印到电子表格和列中，如果广告中的第一个词不是年份，那么我希望在该列中找到车辆的年份，然后它会显示第一个可能是制造商的词。

基本上我想设置 if 语句，以便如果车辆年份不在第一个单词中，而是在字符串中的其他位置，它仍然可以找到它并将其打印到我的 .csv 中。

另外，我一直在努力解析多个页面，并希望这里的人也能提供帮助。 url 中有 page=2 等，但我无法让它解析所有 url 并获取所有页面上的数据。目前我所尝试的一切都只做第一页。您可能已经猜到了，我对 Python 还很陌生。

import csv ; import requests

from bs4 import BeautifulSoup

outfile = open('carandclassic-new.csv','w', newline='', encoding='utf-8')
writer = csv.writer(outfile)
writer.writerow(["Link", "Title", "Year", "Make", "Model", "Variant", "Image"])

url = 'https://www.carandclassic.co.uk/cat/3/?page=2'

get_url = requests.get(url)

get_text = get_url.text

soup = BeautifulSoup(get_text, 'html.parser')


car_link = soup.find_all('div', 'titleAndText', 'image')


for div in car_link:
    links = div.findAll('a')
    for a in links:
        link = ("https://www.carandclassic.co.uk" + a['href'])
        title = (a.text.strip())
        year = (title.split(' ', 1)[0])
        make = (title.split(' ', 2)[1])
        model = (title.split(' ', 3)[2])
        date = "\d"
        for line in title:
        yom = title.split()
        if yom[0] == "\d":
            yom[0] = (title.split(' ', 1)[0])
        else:
            yom = title.date

        writer.writerow([link, title, year, make, model])
        print(link, title, year, make, model)



outfile.close()

请有人帮我解决这个问题吗？我意识到底部的 if 语句可能会有所偏差。

代码成功地从字符串中获取了第一个单词，遗憾的是数据的结构方式并不总是车辆的制造年份 (yom)

【问题讨论】：

这是一个更广泛的问题。 soup.find_all('div', 'titleAndText', 'image') 在您的代码中获取不一致的数据类型

标签： python python-3.x beautifulsoup screen-scraping

【解决方案1】：

评论 "1978 Full restored Datsun 280Z" 变为 '1978' '1978' '280Z'。
而不是'1978' 'Datsun' '280z'

要改进year 验证，请更改为使用re 模块：

import re

if not (len(year) == 4 and year.isdigit()):
    match = re.findall('\d{4}', title)
    if match:
        for item in match:
            if int(item) in range(1900,2010):
                # Assume year
                year = item
                break

输出变成：

'1978 Full restored Datsun 280Z', '1978', 'Full', '280Z'

关于 false 结果make='Full' 您有两个选项。

停用词列表
使用['full', 'restored', etc.] 和loop 和title_items 等术语构建停用词列表，以在停用词列表中找到第一个项not。
厂商列表
建立一个像 ['Mercedes', 'Datsun', etc.] 和 loop 和 title_items 这样的 Maker 列表，以找到第一个匹配项。

问题：如果广告中的第一个词不是年份，则查找车辆的年份

使用build-in和module：

使用的示例标题：

# Simulating html Element
class Element():
    def __init__(self, text): self.text = text

for a in [Element('Mercedes Benz 280SL 1980 Cabriolet in beautiful condition'),
          Element('1964 Mercedes Benz 220SEb Saloon Manual RHD')]:

从<a Element 中获取title 并将其拆分为blanks。

    title = a.text.strip()
    title_items = title.split()

默认值为title_items，索引为0, 1, 2。

    # Default
    year = title_items[0]
    make = title_items[1]
    model = title_items[2]

验证year是否满足条件4位

    # Verify 'year'
    if not (len(year) == 4 and year.isdigit()):

在title_items 中循环所有item，如果条件满足则中断。

        # Test all items
        for item in title_items:
            if len(item) == 4 and item.isdigit():
                # Assume year
                year = item
                break

更改为假设，索引0, 1 处的title_items 是make 和model

        make = title_items[0]
        model = title_items[1]

检查model是否以数字开头

注意：如果模型不符合此条件，这将失败！

    # Condition: Model have to start with digit
    if not model[0].isdigit():
        for item in title_items:
            if item[0].isdigit() and not item == year:
                model = item

    print('{}'.format([title, year, make, model]))

输出：

['Mercedes Benz 280SL 1980 Cabriolet in beautiful condition', '1980', 'Mercedes', '280SL']
['1964 Mercedes Benz 220SEb Saloon Manual RHD', '1964', 'Mercedes', '220SEb']

用 Python 测试：3.4.2

【讨论】：

嗨 Stovfl 非常感谢这个，这似乎很有意义，但我似乎无法将它与我的代码连接起来以使其工作，你能否建议你的代码应该添加到哪里让我的工作让它正常工作？
@BenWillis：阅读What should I do when someone answers my question?
@BenWillis：替换 for a in links: 块内的 all，link = 和 writer.writerow(... 行除外。
嗨@stovfl 谢谢，我已经设法让它工作了。现在唯一的问题是某些“品牌”和“型号”使用数字：1978 完全恢复的 Datsun 280Z 变为“1978”“1978”“280Z”。而不是 '1978' 'Datsun' '280z'
@BenWillis：更新了我的答案