一、目标
地址:https://www.kroger.com/search?query=pasture%20raised%20eggs&searchType=default_search

通过输入关键词爬取首页商品信息,此外zipCode要求能够动态指定,例如 43004
二、接口定位
关键词:Vital Farms® Pasture-Raised Large Brown Eggs
接口:https://www.kroger.com/atlas/v1/product/v2/products
三、精简请求

需逆参数:header中的 x-laf-object
,params中的 params_filter.gtin13s
四、参数逆向
4.1 x-laf-object
4.1.1 接口定位
通过搜索定位接口为 https://www.kroger.com/atlas/v1/modality/preferences
4.1.2 精简请求
.png)
需逆向参数:json_data
4.1.3 参数逆向
4.1.3.1 接口定位
通过搜索定位接口有2
关系:通过接口1拿到一个初始化的地址,通过接口2拿到制定zipCode的信息并入到初始化地址,发起请求从而得到json_data,json提取路径:data.modalityPreferences.lafObject
4.1.3.2 精简请求
.png)
/atlas/v1/modality/preferences
:并无加密参数,精简时除了ua和referer其余的参数一个个试,每个网站校验的未必一致,这里请求头中的origin经测必须要带

atlas/v1/modality/options
并无特别加密,这里是传入指定的zipCode拿到相应的地址信息
4.1.3.3 请求测试
.png)
成功拿到所需要的数据,即 x-laf-object
.png)
脚本与浏览器的比对结果也是一致的。json在线比对网址
4.2 filter.gtin13s
4.2.1 接口定位
通过搜索定位到接口为 https://www.kroger.com/products/api/products/details-basic
4.2.2 精简请求

需逆参数:headers中的x-laf-object
和 json_data中的upc
4.2.3 参数逆向
4.2.3.1 headers中的x-laf-object
定位来源:通过搜索发现来自于 v1/modality/preferences
的返回(此接口之前已经分析过,这里是解析内容差异),json提取路径:data.modalityPreferences.modalities
4.2.3.1 json_data中的upc
4.2.3.1.1 定位来源
通过搜索定位接口为 https://www.kroger.com/atlas/v1/search/v1/products-search
4.2.3.1.2 精简请求

需逆参数:filter.locationId
4.2.3.1.3 参数逆向
filter.locationId:通过搜索发现来自于 /atlas/v1/modality/options
的返回(此接口之前已经分析过),json提取路径:fallbackFulfillment
五、请求测试

测试通过。
六、总结
kroger的难度并不大,无需js逆向,仔细抓包分析找到这6个api基本上就算成功了,遵循以下5个步骤,循环往复:
确定目标>接口定位>精简参数>参数逆向>请求测试
其中需要注意以下几点:
- 网站校验tls指纹,表现形式为用requests库请求会长时间无响应,笔者采用的是curl_cffi库,改库可模拟10种不同浏览器指纹发起请求
- 对代理质量要求较高,对于curl_cffi库的表现为:提示
www.kroger.com
无法解析 - 完整的请求需要走完6个api,但其中3个是地址相关的,可以用redis缓存起来复用,以提高爬取效率。
七、源码
import time
import requests
from jmespath import search
import json
from loguru import logger
class Spider:
def __init__(self, **kwargs):
super().__init__(**kwargs)
def location_1(self, zipCode):
'''通过传入zipCode拿到location'''
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'Origin': 'https://www.kroger.com',
'Referer': 'https://www.kroger.com/',
'Content-Type': 'application/json',
'accept-language': 'en-US,en;q=0.9'
}
data = '{"address":{"postalCode":"%s"}}' % zipCode
url = 'https://www.kroger.com/atlas/v1/modality/options'
logger.info(url)
response = requests.post(url, headers=headers, data=data)
logger.info(response.status_code)
new_location = search('data.modalityOptions.SHIP', response.json())
return new_location
def location_2(self):
'''拿到初始化的地址'''
headers = {
'origin': 'https://www.kroger.com', # need
'referer': 'https://www.kroger.com/',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
}
url = 'https://www.kroger.com/atlas/v1/modality/preferences'
logger.info(url)
response = requests.post(url, headers=headers)
logger.info(response.status_code)
raw_location = search('data.modalityPreferences', response.json())
return raw_location
def location_3(self, new_location, raw_location):
new_location_id = search('id', new_location)
raw_location['modalities'].append(new_location)
locationId = search('fallbackFulfillment', new_location)
raw_location['primaryModality']['SHIP'] = new_location_id
headers = {
'content-type': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'referer': 'https://www.kroger.com/',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
}
json_data = raw_location
url = 'https://www.kroger.com/atlas/v1/modality/preferences'
logger.info(url)
response = requests.put(url, headers=headers, json=json_data)
logger.info(response.status_code)
lafObject = search('data.modalityPreferences.lafObject', response.json())
basic_lafObject = search('data.modalityPreferences.modalities', response.json())
basic_lafObject.pop(0)
return lafObject, basic_lafObject, locationId
def location_final(self, zipCode):
new_lacation = self.location_1(zipCode)
raw_location = self.location_2()
lafObject, basic_lafObject, locationId = self.location_3(new_lacation, raw_location) # 供给第三个请求
data = {
'basic_x_laf_object': json.dumps(basic_lafObject),
'v1or2_x_laf_object': json.dumps(lafObject),
'locationId': locationId
}
return data
def search_1(self, task, locationId, x_laf_object):
url = 'https://www.kroger.com/atlas/v1/search/v1/products-search'
headers = {
'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
'x-laf-object': x_laf_object,
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
}
params = {
'option.facets': ['TAXONOMY', 'BRAND', 'NUTRITION', 'MORE_OPTIONS', 'PRICE', 'SAVINGS', ],
'option.groupBy': 'PRODUCT_VARIANT',
'filter.locationId': locationId,
'filter.query': task['keywordText'],
'filter.fulfillmentMethods': ['IN_STORE', 'PICKUP', 'DELIVERY', 'SHIP', ],
'page.offset': '0',
'page.size': '24',
'option.personalization': 'PURCHASE_HISTORY',
}
logger.info(url)
response = requests.get(url, headers=headers, params=params)
logger.info(response.status_code)
# 拿到upcid
upc_json = json.loads(response.text)
featured_goods_list = search('data.productsSearch[?placementId].[upc,placementId]', upc_json)
if featured_goods_list:
featured_goods_list = dict(featured_goods_list)
else:
featured_goods_list = {}
upc_ids = search('data.productsSearch[].upc', upc_json)
cookie = response.cookies.get_dict()
return upc_ids, featured_goods_list, cookie
def search_2(self, upcid_list, x_laf_object, task, cookie):
headers = {
'content-type': 'application/json',
'origin': 'https://www.kroger.com',
'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
'x-laf-object': x_laf_object,
'accept-language': 'en-US,en;q=0.9'
}
data = {
"upcs": upcid_list
}
url = 'https://www.kroger.com/products/api/products/details-basic'
logger.info(url)
response = requests.post(url=url, headers=headers, json=data, cookies=cookie)
logger.info(response.status_code)
data_json = json.loads(response.text)
upcid2_list = search('products[].upc', data_json)
rank = search('products[].[upc,primaryIndex]', data_json)
if rank:
rank = dict(rank)
cookie.update(response.cookies.get_dict())
return upcid2_list, rank, cookie
def search_3(self, upcid2_list, x_laf_object, cookie, task):
headers = {
'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
'x-laf-object': x_laf_object,
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
}
params = (
('filter.gtin13s', upcid2_list), ('filter.verified', 'true'),
('projections', 'items.full,variantGroupings.compact,offers.compact'),)
url = 'https://www.kroger.com/atlas/v1/product/v2/products'
logger.info(url)
response = requests.get(url=url, headers=headers, params=params, cookies=cookie)
logger.info(response.status_code)
return response
def data_extract(self, task, response, rank_dic, sponsor_item):
# 解析数据
data_for_pool = search('data.products[]', json.loads(response.text))
post_data = {
'keyword': task,
'brands': None,
'products': []
}
for i in data_for_pool:
Id = search('id', i)
if not Id:
logger.error('invalid id -> null')
continue
title = search('item.description', i)
brand = search('item.brand.name', i)
imgUrl = search('item.images[3].url', i)
customerFacingSize = search('item.customerFacingSize', i)
rank = int(rank_dic.get(Id))
if sponsor_item.get(Id):
featured = '1' # 1代表是广告位
else:
featured = ''
price = search('price.storePrices.regular.defaultDescription', i) if search(
'price.storePrices.regular.defaultDescription',
i) else search('price.nationalPrices.regular.defaultDescription', i)
if not price:
price = search('price.storePrices.promo.defaultDescription', i) if search(
'price.storePrices.promo.defaultDescription',
i) else search('price.nationalPrices.promo.defaultDescription', i)
price = price
addtime = time.strftime('%Y-%m-%d %H:%M:%S')
products = {
'index': rank,
'itemId': Id,
'title': title,
'imageUrl': imgUrl,
'price': price,
'reviews': 0,
'rating': 0,
'brand': brand,
'itemType': featured,
'specification': customerFacingSize,
'CreateTime': addtime
}
post_data['products'].append(products)
post_data['products'].sort(key=lambda x: x['index'])
return post_data
def main(self):
task = {
'zipCode': '43004',
'keywordText': 'coffee',
}
zipCode = task['zipCode']
locationSess = self.location_final(zipCode)
basic_x_laf_object = locationSess.get('basic_x_laf_object')
v1or2_x_laf_object = locationSess.get('v1or2_x_laf_object')
locationId = locationSess.get('locationId')
upcids, featured_goods_list, cookie = self.search_1(task, locationId, v1or2_x_laf_object)
upcids2, rank_dic, cookie = self.search_2(upcids, basic_x_laf_object, task, cookie)
res_data = self.search_3(upcids2, v1or2_x_laf_object, cookie, task)
post_data = self.data_extract(task, res_data, rank_dic, featured_goods_list)
logger.info(f'数据:keyword:{task["keywordText"]},zipCode:{task["zipCode"]},len:{len(post_data["products"])},data:{post_data}')
if __name__ == '__main__':
spider = Spider()
spider.main()
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达,如有问题请邮件至2454612285@qq.com。