当前位置：网站首页>Crawler case 1: JS reversely obtains HD Wallpapers of minimalist Wallpapers

Crawler case 1: JS reversely obtains HD Wallpapers of minimalist Wallpapers

2022-06-26 08:03:00 【Live Firestone】

List of articles

Preface
One 、 Anti climbing means of minimalist wallpapers
Two 、 Climbing process
summary

Preface

The main technical points introduced in this paper ：

be based on requests Modular post request
Learn some js reverse

One 、 Anti climbing means of minimalist wallpapers

Can't use F12 Call up the packet capturing tool
js Anti creeping
Based on humanitarianism , Domestic conscience wallpaper , I hope you will be a kind reptile

Two 、 Climbing process

1. Call up the packet capturing tool

Insert picture description here

From the above, we can see that the size of the image URL obtained by capturing packets is different from the proportion shown in the web page , Simultaneously from a Labeled website We can see that the picture should be a thumbnail , Just to show .

2. Find the address of the picture

Since the image cannot be found in the web page source code , We keep loading pictures , It is found that it is actually dynamically loaded , Then we can start from the Network Search for

Generally, dynamically loaded files are placed in XHR In this option , With the sliding wheel , We found that getJson This file has been growing , It is concluded that the URL of the picture is here

But when we open the contents of the file , It was these things that I found

At that time, I was in a state of ignorance , What is this （ my fuck） And then gave up ……
Of course this is impossible , Since dynamic loading gets these things , It must have something to do with the address of the picture , So when I click on a picture

Found an extra URL in the packet capturing tool , Then get the URL and open it , It is the address of the picture , So we get the address of the picture

3. Image address resolution

Let's get more picture addresses , See what the difference is ：
https://w.wallhaven.cc/full/5w/wallhaven-5wo3j8.jpg
https://w.wallhaven.cc/full/ym/wallhaven-ym3veg.jpg
https://w.wallhaven.cc/full/83/wallhaven-836rgy.jpg

We can find that each web site is generally different from the latter , At the same time, I suddenly remembered getJson The source code , Search its source code 5wo3j8

So we know that this is the part of the picture address , We just need to get getJson The source code in .

4. Download the pictures

When we look at getJson The request of , Found to be post request （ ah , I haven't used it for a long time post Request the , Then I opened my notes and looked at them post Process of request ）
First write the request header

headers = {
    
        "accept-encoding": "gzip,deflate,br",
        "accept-language": "zh-CN, zh,q = 0.9, en - US,q = 0.8, en,q = 0.7, zh - TW,q = 0.6",
        "access": "918c3eb8f5d471ffdc7a34365152b220b07d8bdc5c991a47961564a62b84834",
        "content-length": "30",
        "content-type": "application/json",
        "location": "bz.zzzmh.cn",
        "origin": "https://bz.zzzmh.cn",
        "referer": "https://bz.zzzmh.cn/",
        "sign": "273a3b6b44a285e367af744c37eb30f6",
        "timestamp": "1614348146725",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36",
    }

Here we analyze each getJson The request header of the page
Insert picture description here

We will find that every getJson Of documents access and timestamp Dissimilarity
there access Encryption algorithm is used （ Secure hash algorithm ）
As for how to encrypt here , We need to check access The formation of

import time
import hashlib
timestamp = str(int(time.time()*1000))   #  The length of the timestamp should correspond to the timestamp of the request 
# print(timestamp)
access = "application/json" + "bz.zzzmh.cn" + "273a3b6b44a285e367af744c37eb30f6"+ timestamp
access = hashlib.sha256(access.encode("utf-8")).hexdigest()

Insert picture description here

Here we get access and timestamp
At this point, we can write code

import json
import requests
import time
import hashlib

""" post The request needs to carry parameters  "content-type": "application/json"  When content-type The value of is json When the format ,post The request must be json Format   When content-type When the value of is in key value pair format ,post The request must be in the form of a key value pair   Every web address is different  timestamp:  Time stamp  access:  encryption algorithm ：sha family （ Secure hash algorithm ） """

timestamp = str(int(time.time()*1000))   #  The length of the timestamp should correspond to the timestamp of the request 
# print(timestamp)
access = "application/json" + "bz.zzzmh.cn" + "273a3b6b44a285e367af744c37eb30f6"+ timestamp
access = hashlib.sha256(access.encode("utf-8")).hexdigest()



def main():
    url = "https://api.zzzmh.cn/bz/getJson"
    headers = {
    
        "accept-encoding": "gzip,deflate,br",
        "accept-language": "zh-CN, zh,q = 0.9, en - US,q = 0.8, en,q = 0.7, zh - TW,q = 0.6",
        "access": access,
        "content-length": "30",
        "content-type": "application/json",
        "location": "bz.zzzmh.cn",
        "origin": "https://bz.zzzmh.cn",
        "referer": "https://bz.zzzmh.cn/",
        "sign": "273a3b6b44a285e367af744c37eb30f6",
        "timestamp": timestamp,
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36",
    }
    data = {
    
        "pageNum": "1",
        "target": "index"
    }
    response = requests.post(url, headers=headers, data=json.dumps(data)).json()
    imageList = response.get("result").get("records")
    for image in imageList:
        imageType = image.get("t")
        imageNum = image.get("i")
        newurl = "https://w.wallhaven.cc/full/{}/wallhaven-{}.{}"
        if imageType == "j":
            newurl = newurl.format(imageNum[:2],imageNum, "jpg")

        elif imageType == "p":
            newurl = newurl.format(imageNum[:2],imageNum, "png")

        print(" Downloading ："+newurl)
        image = requests.get(newurl).content
        path = "D:\Code\spider\douban\python Reptile advanced \ picture \\" + imageNum + ".jpg"
        with open(path, 'wb') as f:
            f.write(image)




if __name__ == '__main__':
    main()