当前位置:网站首页>Use mitmproxy to cache 360 degree panoramic web pages offline

Use mitmproxy to cache 360 degree panoramic web pages offline

2022-07-06 23:11:00 Xiaoming - code entity

Blog home page :https://blog.csdn.net/as604049322

Welcome to thumb up Collection Leaving a message. Welcome to discuss !

This paper is written by Xiaoming - Code entities original , First appeared in CSDN

There was a problem yesterday :

image-20220705085126329

Some involve dynamically loaded web pages , It is impossible to save all resources with the browser's own function of saving web pages .

If we save documents one by one by hand , There are too many :

image-20220705085423728

Too many folders , One layer at a time .

At this time, I want to cache the target web page offline , Thought of a good way , That is through support python Programmed agent , Let every request be based on URL Save the corresponding file locally .

MitmProxy Installation

Comparison recommendation MitmProxy, The installation method is executed in the command line :

pip install mitmproxy

MitmProxy It is divided into mitmproxy,mitmdump and mitmweb Three commands , among mitmdump Support the use of specified python The script handles each request ( Use -s Parameter assignment ).

After installation, we need the installation certificate MitmProxy Corresponding certificate , visit :http://mitm.it/

Direct access will show :If you can see this, traffic is not passing through mitmproxy.

Here we're going to do it first mitmweb Start a web proxy server :

image-20220705093554905

We give the tourist we use , Set the address of the proxy server , With 360 Take the safe tour as an example :

image-20220705093759680

Set up and use MitmProxy Access again after the proxy server provided http://mitm.it/ You can download and install the certificate :

image-20220705094637765

After downloading, open the certificate and click next to complete the installation .

Visit Baidu at this time , You can see MitmProxy Certificate validation information for :

image-20220705095101007

To write mitmdump Required scripts

mitmdump The template code of the supported script is as follows :

#  All sent request packets will be processed by this method 
def request(flow):
    #  Get request object 
    request = flow.request
    
#  All server response packets are processed by this method 
def response(flow):
    #  Get the response object 
    response = flow.response

request and response Object and the requests The objects in the library are almost the same .

Our demand is based on url Save the file , Just process the response , Try caching Baidu homepage first :

import os
import re

dest_url = "https://www.baidu.com/"


def response(flow):
    url = flow.request.url
    response = flow.response
    if response.status_code != 200 or not url.startswith(dest_url):
        return
    r_pos = url.rfind("?")
    url = url if r_pos == -1 else url[:r_pos]
    url = url if url[-1] != "/" else url+"index.html"
    path = re.sub("[/\\\\:\\*\\?\\<\\>\\|\"\s]", "_", dest_url.strip("htps:/"))
    file = path + "/" + url.replace(dest_url, "").strip("/")
    r_pos = file.rfind("/")
    if r_pos != -1:
        path, file_name = file[:r_pos], file[r_pos+1:]
    os.makedirs(path, exist_ok=True)

    with open(file, "wb") as f:
        f.write(response.content)

Save the above script as dump.py Then start the agent with the following command ( Close the previously started mitmweb):

>mitmdump -s dump.py
Loading script dump.py
Proxy server listening at http://*:8080

After refreshing the page, baidu home page has been successfully cached :

image-20220705103348348

Use python Test the built-in server and visit :image-20220705103914089

You can see that you have successfully visited the local Baidu .

Offline caching 360 Panoramic web page

Put the dest_url Change to the following address and save :

dest_url = "https://img360wcs.soufunimg.com/2022/03/25/gz/720/3943919a3a7b46769db6f2db1f4250e5/html"

Revisit :https://img360wcs.soufunimg.com/2022/03/25/gz/720/3943919a3a7b46769db6f2db1f4250e5/html/index.html

If you find that the saved files are not complete , You can open developer tools , Check the network tab Disable caching after , Refresh the page again :

image-20220705105213843

At this time, the main file has been cached :

image-20220705105256078

At this time, just visit all directions on the original web page as much as possible , And zoom in and out to cache as many high-definition detail pictures as possible .

Using the local server to start the test has been successfully accessed :

image-20220705110127236

However, the original script only caches the response code as 200 The ordinary documents of , The above website will also return a response code of 206 Music files , If caching is also needed, it is a little more complicated , Now let's study how to cache music files .

cache 206 Split file

After some research , Modify the above code to the following form :

import os
import re

dest_url = "https://img360wcs.soufunimg.com/2022/03/25/gz/720/3943919a3a7b46769db6f2db1f4250e5/html"


def response(flow):
    url = flow.request.url
    response = flow.response
    if response.status_code not in (200, 206) or not url.startswith(dest_url):
        return
    r_pos = url.rfind("?")
    url = url if r_pos == -1 else url[:r_pos]
    url = url if url[-1] != "/" else url+"index.html"
    path = re.sub("[/\\\\:\\*\\?\\<\\>\\|\"\s]", "_", dest_url.strip("htps:/"))
    file = path + "/" + url.replace(dest_url, "").strip("/")
    r_pos = file.rfind("/")
    if r_pos != -1:
        path, file_name = file[:r_pos], file[r_pos+1:]
    os.makedirs(path, exist_ok=True)

    if response.status_code == 206:
        s, e, length = map(int, re.fullmatch(
            r"bytes (\d+)-(\d+)/(\d+)", response.headers['Content-Range']).groups())
        if not os.path.exists(file):
            with open(file, "wb") as f:
                pass
        with open(file, "rb+") as f:
            f.seek(s)
            f.write(response.content)
    elif response.status_code == 200:
        with open(file, "wb") as f:
            f.write(response.content)

Save the modified script ,mitmdump It can be reloaded automatically :

image-20220705120628591

After cleaning up the cache and re accessing , The music files have been downloaded successfully :

image-20220705120742615

summary

adopt mitmdump We have successfully implemented the caching of the designated website , If you want to cache other websites locally in the future, you only need to modify dest_url The website of .

原网站

版权声明
本文为[Xiaoming - code entity]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207061540064593.html