谷歌插件绕过cookie反爬_某境外电商网站

爬虫 javascript web爬虫 python 谷歌插件 flask

发布时间 : 2023-06-18 15:54

阅读 :

目标
前期工作

目标

url: https://www.smythstoys.com/de/de-de 搜索任意词,爬取其搜索结果
字段:标题,图片,价格,详情页url

前期工作

寻找接口

在搜索栏搜索toy，在搜索结果中截取关键词在开发者工具中再次搜索，找到关键接口：https://www.smythstoys.com/de/de-de/search/?text=toy

定位接口

简化请求

经过测试,存在一个加密参数:reese84

reese84参数解密

定位入口

利用proxyman抓包工具在response headers和response body搜索reese84并没有搜索到结果；然后尝试搜值,原来是接口body返回的,只不过名字叫token,接口是:https://www.smythstoys.com/mbit-And-Dirers-him-Face-and-sure-such-Parry-qui

追本溯源

重放测试mbit-And-Dirers-him-Face-and-sure-such-Parry-qui这个接口,但返回的是一个验证码的页面,状态码是400。

需要过验证码
如果这个任务并不要求高并法，有没有更简便的方法？有的，仔细观察下接口的内容和收发情况，可以发现这么一个规律：这个参数每隔一段时间会自动通过后端返回，时间约为10min左右。

接口的规律性

这样一来将爬虫寄生到浏览器里岂不是可以自动更新cookie参数？寄生的方式可以选择用谷歌插件的Content Scripts功能，制作插件之前先来测试一下fetch请求是否可以通过。

fetch请求可以通过

谷歌插件制作

谷歌插件开发官方文档

思路设想

请求方式采用fetch，其特点无需写cookie参数，它可以自动携带当前域名下的cookie，用来发起请求。
寄生环境选择网站首页（其它页面也行，只要不跨域）。
数据接收有2种选择，第一种是rpc（websocket协议）转发，第二种是flask搭建本地服务器，这里选择flask（flask可以通过flask_pymongo拓展很方便地保存数据）。
控制界面采用popup窗口，内置输入关键词、开始爬取2个按钮。

pop窗口

项目目录结构

manifest.json

manifest.json是整个插件的功能和文件配置清单。

{
  "name": "smythstoys Spider", //插件名称 
  "version": "1.0", //插件版本
  "manifest_version": 3, //代表此扩展程序使用的 manifest.json 版本，目前最新版本为3
  "description": "crawl", //插件描述
  "action": {
    "default_title": "smythstoys Spider", //pop名称
    "default_popup": "popup.html" //pop指向的html
  },
  "content_scripts": [
    {
      "js": [
        "content_scripts.js" //内容注入的js脚本
      ],
      "matches": [
        "*://www.smythstoys.com/*" //需要注入的域名
      ],
      "run_at": "document_start" //注入时机，这里代表页面开始加载和渲染的那个阶段,也就是接收 HTTP 响应之后,开始解析 HTML 并渲染 DOM 结构的那个过程
    }
  ],
  "permissions": [
    "tabs" //API权限，需要使用某些API时需要设置该API权限才行
  ],
  "host_permissions": [
    "<all_urls>" //主机权限，在背景页backgroud.js里面或者popup页面走请求时，请求域名的白名单权限，这里设置为不受限制，向任何域名发送请求都可以
  ]
}

当用户单击扩展程序的图标时出现的弹出窗口内容。

<html>
<meta charset="UTF-8">
<head>
    <style>
        .inline {
            display: inline-block
        }
    </style>
</head>
<body>
<form>
    <fieldset>
        <legend>spider</legend>
        <p>
            <input class='inline' id="keyword" placeholder="搜索词"/>
            <button class='inline' id="start">开始</button>
        </p>
    </fieldset>
</form>
<script src="popup.js"></script>
</body>
</html>

inline-block 是一个CSS属性，它允许将一个元素既作为行内元素（inline）又作为块级元素（block），从而可以让元素在文档流中既可以水平排列，也可以垂直排列。这里的作用是让搜索词和开始在一行显示，否则他们会分成2行（因为他们本身是块级元素）。

pop.html的js外链，用于做一些行为交互。

(function () {
    document.getElementById('start').addEventListener('click', function () {
        chrome.tabs.query({active: true, currentWindow: true}, function (tabs) { //chrome插件组件之间的消息传递，这里先找到当前窗口激活的标签页
            chrome.tabs.sendMessage(tabs[0].id, {'operation': 'start', 'data': document.getElementById('keyword').value});//发送开始爬取的指令和带爬取的任务
        });
    });
})();

content_scripts.js

Content Scripts可以在网页上运行JavaScript代码，并且可以在不被察觉的情况下修改和监控页面的DOM元素和事件。Content Scripts可以与插件程序的其他部分进行通信，并且可以使用插件的权限来访问浏览器的API。

//这些写可以同时处理status和text，常规的.then只能处理text
function handleResponse(response) {
    return response.text().then(text => {
        return {
            'status': response.status,
            'text': text,
        }
    })
}

function crawl(keyword) {
    let url = `https://www.smythstoys.com/de/de-de/search/?text=${keyword}`
    console.log(`请求: ${url}`)
    fetch(url).then(handleResponse)
        .then((res) => {
                console.log(`下载: ${res.status}`);
                fetch(`http://127.0.0.1:8081/result`, { //与后端交互的接口
                    headers: { 
                        "content-type": "application/json" //post方法一定要携带这个请求头
                    },
                    method: 'POST',
                    body: JSON.stringify({
                        'keyword': keyword,
                        'res': res
                    })
                }).then(res => res.text())
                    .then((res) => {
                        console.log(`保存: ${res}`);
                    })

            }
        )
}

chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
    if (message['operation'] === 'start') { //如果接收到开始的指令
        console.log('crawl is going to start...')
        crawl(message['data']) //就启动对应的方法开始爬取
    }
})
console.log('content js has injected!')

flask_server.py

Flask是一个轻量级的Web应用框架，基于Werkzeug WSGI工具箱和Jinja2模板引擎。它是一个用Python编写的轻量级Web框架，可以让我们轻松地创建Web应用程序。

from flask_cors import CORS # 允许接收跨域的请求
from flask import Flask, request 
from parsel import Selector # html解析库
from loguru import logger
from flask_pymongo import PyMongo # flask的pymongo插件

app = Flask(__name__)
app.config["MONGO_URI"] = "mongodb://localhost:27017/spider"
mongo = PyMongo(app)
CORS(app)


@app.route('/result', methods=['POST']) # 通过一个路由实现接收数据+解析+保存的工作，与谷歌插件搭配起来很灵活
def result():
    keyword = request.json['keyword']
    logger.debug(f'请求: https://www.smythstoys.com/de/de-de/search/?text={keyword}')
    status = request.json['res']['status']
    logger.debug(f'下载: {status}')
    data = request.json['res']['text']
    s = Selector(text=data)
    items = s.css('article.st-layout-item')
    data2save = {'keyword': keyword, 'products': []}
    for item in items:
        title = item.xpath(".//h2[contains(@class,'prodName trackProduct')]/text()").get()
        price = item.xpath(".//div[@itemprop='price']/@content").get()
        img = item.xpath(".//picture/img[@class='lazy']/@data-src").get()
        href = item.xpath(".//a[@class='trackProduct']/@href").get()
        if href:
            href = 'https://www.smythstoys.com' + href
        data2save['products'].append({
            'title': title,
            'price': price,
            'img': img,
            'href': href
        })
    logger.debug(f'解析: {data2save}')
    mongo.db['smythstoys'].insert_one(data2save)  # 存入 MongoDB
    logger.debug(f'保存: success')

    return 'Success'


if __name__ == '__main__':
    app.run(host='127.0.0.1', port=8081)

结果

浏览器界面

flask界面

mongodb界面

总结

通过这个任务，可以学到一个新的爬虫思路，那就是利用谷歌插件配合flask来制作爬虫，从而绕过一些复杂的参数加密，但这种方案也存在一些自身的局限性，例如：
1）不适用某些高并发的爬虫需求
2）需要常驻打开一个浏览器
但如果对并发的需求并不高，又不想花太多的精力破解加密参数，那么这也不失为一种优雅的选择。

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达，如有问题请邮件至2454612285@qq.com。