有些人可能不赞成 Python 是解决方案语言,但这样做的目的是提供一些指示(高级流程)。我很久没有写 R 了,所以 Python 更快。
编辑:现在添加了 R 脚本
总纲:
可以从使用#param571 option 的css 选择器返回的每个节点的value 属性中获取第一个下拉选项。这使用id selector (#) 来定位父下拉列表select 元素,然后在descendant combination 中使用option type 选择器,以指定其中的option 标记元素。可以通过对您最初提供的url 的 xhr 请求来检索应用此选择器组合的 html。您希望返回一个 nodeList 进行迭代;类似于使用 js document.querySelectorAll 应用选择器。
页面使用 ajax POST 请求根据您的第一个下拉选项更新第二个下拉列表。您的第一个下拉选项决定了参数search[filter_enum_make] 的值,该参数用于对服务器的 POST 请求。随后的响应包含可用选项的列表(它包括一些可以删减的案例替代方案)。
我使用fiddler 捕获了POST 请求。这向我展示了请求正文中的请求标头和参数。屏幕截图示例显示在末尾。
从响应文本 IMO 中提取选项的最简单方法是将适当的字符串正则表达式输出(我通常不建议使用正则表达式来处理 html,但在这种情况下它很好地为我们服务)。如果您不想使用正则表达式,您可以从 id 为 body-container 的元素的 data-facets 属性中获取相关信息。对于非正则表达式版本,您需要处理未引用的nulls,并检索其键为filter_enum_model 的内部字典。最后,我展示了一个重写的函数来处理这个问题。
检索到的字符串是字典的字符串表示形式。这需要转换为实际的字典对象,然后您可以从中提取选项值。编辑:由于 R 没有字典对象,因此需要找到类似的结构。我会在转换时看看这个。
我创建了一个用户定义函数getOptions(),以返回每个make 的选项。每个汽车 make 值都来自第一个下拉列表中的可能项目列表。我循环这些可能的值,使用该函数返回该 make 的选项列表,并将这些列表作为值添加到字典 results ,其键是汽车的 make。同样,对于 R,需要找到与 python 字典具有相似功能的对象。
该列表字典需要转换为包含转置操作的数据帧,以生成整齐的标题输出,即汽车制造,以及每个标题下方的列,其中包含相关模型。
最后可以将整个内容写入csv。
所以,希望这能让您了解实现所需目标的一种方法。也许其他人可以使用它来帮助您编写解决方案。
下面的 Python 演示:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
csv 输出示例:
作为 alfa-romeo 的示例 json 示例:
alfa-romeo 的正则表达式匹配示例:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
make参数值为alfa-romeo的函数调用返回的过滤器选项列表示例:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
提琴手请求示例:
包含选项的 ajax 响应 html 示例:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
没有正则表达式的替代版本的函数:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R转换和改进的python:
在转换为 R 时,我发现了一种从服务器上的 js 文件中提取参数的更好方法。如果您打开开发工具,您可以看到源选项卡中列出的文件。
R(待改进):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)