海外电商平台kroger爬虫

一、目标

地址:https://www.kroger.com/search?query=pasture%20raised%20eggs&searchType=default_search

通过输入关键词爬取首页商品信息,此外zipCode要求能够动态指定,例如 43004

二、接口定位

关键词:Vital Farms® Pasture-Raised Large Brown Eggs

接口:https://www.kroger.com/atlas/v1/product/v2/products

三、精简请求

需逆参数:header中的 x-laf-object,params中的 params_filter.gtin13s

四、参数逆向

4.1 x-laf-object

4.1.1 接口定位

通过搜索定位接口为 https://www.kroger.com/atlas/v1/modality/preferences

4.1.2 精简请求

需逆向参数:json_data

4.1.3 参数逆向

4.1.3.1 接口定位

通过搜索定位接口有2

https://www.kroger.com/atlas/v1/modality/preferences

https://www.kroger.com/atlas/v1/modality/options

关系:通过接口1拿到一个初始化的地址,通过接口2拿到制定zipCode的信息并入到初始化地址,发起请求从而得到json_data,json提取路径:data.modalityPreferences.lafObject

4.1.3.2 精简请求

/atlas/v1/modality/preferences :并无加密参数,精简时除了ua和referer其余的参数一个个试,每个网站校验的未必一致,这里请求头中的origin经测必须要带

atlas/v1/modality/options 并无特别加密,这里是传入指定的zipCode拿到相应的地址信息

4.1.3.3 请求测试

成功拿到所需要的数据,即 x-laf-object

脚本与浏览器的比对结果也是一致的。json在线比对网址

4.2 filter.gtin13s

4.2.1 接口定位

通过搜索定位到接口为 https://www.kroger.com/products/api/products/details-basic

4.2.2 精简请求

需逆参数:headers中的x-laf-objectjson_data中的upc

4.2.3 参数逆向

4.2.3.1 headers中的x-laf-object

定位来源:通过搜索发现来自于 v1/modality/preferences的返回(此接口之前已经分析过,这里是解析内容差异),json提取路径:data.modalityPreferences.modalities

4.2.3.1 json_data中的upc

4.2.3.1.1 定位来源

通过搜索定位接口为 https://www.kroger.com/atlas/v1/search/v1/products-search

4.2.3.1.2 精简请求

需逆参数:filter.locationId

4.2.3.1.3 参数逆向

filter.locationId:通过搜索发现来自于 /atlas/v1/modality/options的返回(此接口之前已经分析过),json提取路径:fallbackFulfillment

五、请求测试

测试通过。

六、总结

kroger的难度并不大,无需js逆向,仔细抓包分析找到这6个api基本上就算成功了,遵循以下5个步骤,循环往复:

确定目标>接口定位>精简参数>参数逆向>请求测试

其中需要注意以下几点:

  1. 网站校验tls指纹,表现形式为用requests库请求会长时间无响应,笔者采用的是curl_cffi库,改库可模拟10种不同浏览器指纹发起请求
  2. 对代理质量要求较高,对于curl_cffi库的表现为:提示 www.kroger.com无法解析
  3. 完整的请求需要走完6个api,但其中3个是地址相关的,可以用redis缓存起来复用,以提高爬取效率。

七、源码

import time
import requests
from jmespath import search
import json
from loguru import logger


class Spider:

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def location_1(self, zipCode):
        '''通过传入zipCode拿到location'''
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
            'Origin': 'https://www.kroger.com',
            'Referer': 'https://www.kroger.com/',
            'Content-Type': 'application/json',
            'accept-language': 'en-US,en;q=0.9'
        }

        data = '{"address":{"postalCode":"%s"}}' % zipCode
        url = 'https://www.kroger.com/atlas/v1/modality/options'
        logger.info(url)
        response = requests.post(url, headers=headers, data=data)
        logger.info(response.status_code)
        new_location = search('data.modalityOptions.SHIP', response.json())
        return new_location

    def location_2(self):
        '''拿到初始化的地址'''
        headers = {
            'origin': 'https://www.kroger.com',  # need
            'referer': 'https://www.kroger.com/',
            'accept-language': 'en-US,en;q=0.9',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
        }
        url = 'https://www.kroger.com/atlas/v1/modality/preferences'
        logger.info(url)
        response = requests.post(url, headers=headers)
        logger.info(response.status_code)
        raw_location = search('data.modalityPreferences', response.json())
        return raw_location

    def location_3(self, new_location, raw_location):
        new_location_id = search('id', new_location)
        raw_location['modalities'].append(new_location)
        locationId = search('fallbackFulfillment', new_location)
        raw_location['primaryModality']['SHIP'] = new_location_id
        headers = {
            'content-type': 'application/json',
            'accept-language': 'en-US,en;q=0.9',
            'referer': 'https://www.kroger.com/',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
        }

        json_data = raw_location
        url = 'https://www.kroger.com/atlas/v1/modality/preferences'
        logger.info(url)
        response = requests.put(url, headers=headers, json=json_data)
        logger.info(response.status_code)
        lafObject = search('data.modalityPreferences.lafObject', response.json())
        basic_lafObject = search('data.modalityPreferences.modalities', response.json())
        basic_lafObject.pop(0)
        return lafObject, basic_lafObject, locationId

    def location_final(self, zipCode):
        new_lacation = self.location_1(zipCode)
        raw_location = self.location_2()
        lafObject, basic_lafObject, locationId = self.location_3(new_lacation, raw_location)  # 供给第三个请求
        data = {
            'basic_x_laf_object': json.dumps(basic_lafObject),
            'v1or2_x_laf_object': json.dumps(lafObject),
            'locationId': locationId
        }
        return data

    def search_1(self, task, locationId, x_laf_object):
        url = 'https://www.kroger.com/atlas/v1/search/v1/products-search'
        headers = {
            'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
            'x-laf-object': x_laf_object,
            'accept-language': 'en-US,en;q=0.9',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
        }
        params = {
            'option.facets': ['TAXONOMY', 'BRAND', 'NUTRITION', 'MORE_OPTIONS', 'PRICE', 'SAVINGS', ],
            'option.groupBy': 'PRODUCT_VARIANT',
            'filter.locationId': locationId,
            'filter.query': task['keywordText'],
            'filter.fulfillmentMethods': ['IN_STORE', 'PICKUP', 'DELIVERY', 'SHIP', ],
            'page.offset': '0',
            'page.size': '24',
            'option.personalization': 'PURCHASE_HISTORY',
        }
        logger.info(url)
        response = requests.get(url, headers=headers, params=params)
        logger.info(response.status_code)

        # 拿到upcid
        upc_json = json.loads(response.text)
        featured_goods_list = search('data.productsSearch[?placementId].[upc,placementId]', upc_json)
        if featured_goods_list:
            featured_goods_list = dict(featured_goods_list)
        else:
            featured_goods_list = {}
        upc_ids = search('data.productsSearch[].upc', upc_json)
        cookie = response.cookies.get_dict()
        return upc_ids, featured_goods_list, cookie

    def search_2(self, upcid_list, x_laf_object, task, cookie):
        headers = {
            'content-type': 'application/json',
            'origin': 'https://www.kroger.com',
            'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
            'x-laf-object': x_laf_object,
            'accept-language': 'en-US,en;q=0.9'
        }

        data = {
            "upcs": upcid_list
        }

        url = 'https://www.kroger.com/products/api/products/details-basic'
        logger.info(url)
        response = requests.post(url=url, headers=headers, json=data, cookies=cookie)
        logger.info(response.status_code)
        data_json = json.loads(response.text)
        upcid2_list = search('products[].upc', data_json)
        rank = search('products[].[upc,primaryIndex]', data_json)
        if rank:
            rank = dict(rank)
        cookie.update(response.cookies.get_dict())
        return upcid2_list, rank, cookie

    def search_3(self, upcid2_list, x_laf_object, cookie, task):
        headers = {
            'referer': f'https://www.kroger.com/search?query={task["keywordText"]}&searchType=default_search',
            'x-laf-object': x_laf_object,
            'accept-language': 'en-US,en;q=0.9',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
        }

        params = (
            ('filter.gtin13s', upcid2_list), ('filter.verified', 'true'),
            ('projections', 'items.full,variantGroupings.compact,offers.compact'),)
        url = 'https://www.kroger.com/atlas/v1/product/v2/products'
        logger.info(url)
        response = requests.get(url=url, headers=headers, params=params, cookies=cookie)
        logger.info(response.status_code)
        return response

    def data_extract(self, task, response, rank_dic, sponsor_item):
        # 解析数据
        data_for_pool = search('data.products[]', json.loads(response.text))
        post_data = {
            'keyword': task,
            'brands': None,
            'products': []
        }
        for i in data_for_pool:
            Id = search('id', i)
            if not Id:
                logger.error('invalid id -> null')
                continue
            title = search('item.description', i)
            brand = search('item.brand.name', i)
            imgUrl = search('item.images[3].url', i)
            customerFacingSize = search('item.customerFacingSize', i)
            rank = int(rank_dic.get(Id))
            if sponsor_item.get(Id):
                featured = '1'  # 1代表是广告位
            else:
                featured = ''
            price = search('price.storePrices.regular.defaultDescription', i) if search(
                'price.storePrices.regular.defaultDescription',
                i) else search('price.nationalPrices.regular.defaultDescription', i)
            if not price:
                price = search('price.storePrices.promo.defaultDescription', i) if search(
                    'price.storePrices.promo.defaultDescription',
                    i) else search('price.nationalPrices.promo.defaultDescription', i)

            price = price
            addtime = time.strftime('%Y-%m-%d %H:%M:%S')
            products = {
                'index': rank,
                'itemId': Id,
                'title': title,
                'imageUrl': imgUrl,
                'price': price,
                'reviews': 0,
                'rating': 0,
                'brand': brand,
                'itemType': featured,
                'specification': customerFacingSize,
                'CreateTime': addtime
            }
            post_data['products'].append(products)
        post_data['products'].sort(key=lambda x: x['index'])
        return post_data

    def main(self):
        task = {
            'zipCode': '43004',
            'keywordText': 'coffee',
        }
        zipCode = task['zipCode']
        locationSess = self.location_final(zipCode)
        basic_x_laf_object = locationSess.get('basic_x_laf_object')
        v1or2_x_laf_object = locationSess.get('v1or2_x_laf_object')
        locationId = locationSess.get('locationId')
        upcids, featured_goods_list, cookie = self.search_1(task, locationId, v1or2_x_laf_object)
        upcids2, rank_dic, cookie = self.search_2(upcids, basic_x_laf_object, task, cookie)
        res_data = self.search_3(upcids2, v1or2_x_laf_object, cookie, task)
        post_data = self.data_extract(task, res_data, rank_dic, featured_goods_list)
        logger.info(f'数据:keyword:{task["keywordText"]},zipCode:{task["zipCode"]},len:{len(post_data["products"])},data:{post_data}')


if __name__ == '__main__':
    spider = Spider()
    spider.main()

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达,如有问题请邮件至2454612285@qq.com。
跃迁主页