当前位置:网站首页>Go crawler framework -colly actual combat (III) -- panoramic cartoon picture capture and download
Go crawler framework -colly actual combat (III) -- panoramic cartoon picture capture and download
2022-06-25 00:17:00 【You're like an ironclad treasure】
Original link :Hzy Blog
Try to take advantage of today colly Grab a photo website and download it , It's still fun .
Next, go directly to the code .
The complete code can be found in my github On , Will always update some learning go Some of the small problems , Write some small examples !
github
Be careful :
- Need to add cookies, Otherwise access will be denied .
- One request seems to be at most 200 A picture , It is useless to adjust the parameters in time , So we have to cycle .
- Logic , Use a collector , Grab page , Using another collector to , Download the pictures .
package main
import (
"bytes"
"encoding/json"
"fmt"
"github.com/gocolly/colly"
"github.com/gocolly/colly/extensions"
"io"
"net/url"
"os"
"strings"
"time"
)
//todo: Use colly Crawling https://www.quanjing.com Cartoon pictures in
/*
1. First F12, Observe https://www.quanjing.com/search.aspx?q=%E5%8D%A1%E9%80%9A#%E5%8D%A1%E9%80%9A||1|1000|3|2|||||| This website
2. You will find that this is actually a pass json To load data ! Through this URL:https://www.quanjing.com/Handler/SearchUrl.ashx
3. Let's see what parameters are passed , Just follow it .
*/
func main(){
t :=time.Now()
c :=colly.NewCollector(func(collector *colly.Collector) {
collector.Async=true
extensions.RandomUserAgent(collector)
})
imageC :=c.Clone()
// Request header
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("Cookie","BIGipServerPools_Web_ssl=2135533760.47873.0000; Hm_lvt_c01558ab05fd344e898880e9fc1b65c4=1577432018; qimo_seosource_578c8dc0-6fab-11e8-ab7a-fda8d0606763=%E7%BB%94%E6%AC%8F%E5%94%B4; qimo_seokeywords_578c8dc0-6fab-11e8-ab7a-fda8d0606763=; accessId=578c8dc0-6fab-11e8-ab7a-fda8d0606763; pageViewNum=3; Hm_lpvt_c01558ab05fd344e898880e9fc1b65c4=1577432866")
r.Headers.Add("referer", "https://www.quanjing.com/search.aspx?q=%E5%8D%A1%E9%80%9A")
r.Headers.Add("sec-fetch-mode", "cors")
r.Headers.Add("sec-fetch-site", "same-origin")
r.Headers.Add("accept", "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01")
r.Headers.Add("accept-encoding", "gzip, deflate, br")
r.Headers.Add("accept-language", "en,zh-CN;q=0.9,zh;q=0.8")
r.Headers.Add("X-Requested-With", "XMLHttpRequest")
})
// Construct pictures url, Let the picture imageC Collector to download pictures
c.OnResponse(func(r *colly.Response) {
var f interface{}
if err := json.Unmarshal(r.Body[13:len(r.Body)-1], &f);err!=nil{
panic(err)
}
imgList := f.(map[string]interface{})["imglist"]
for k,img :=range imgList.([]interface{}){
url :=img.(map[string]interface{})["imgurl"].(string)
url = url +"#"+img.(map[string]interface{})["caption"].(string)
fmt.Printf("find -->%d:%s\n",k,url)
imageC.Visit(url)
}
})
c.OnError(func(response *colly.Response, err error) {
fmt.Println(err)
})
// According to the picture url To download pictures
imageC.OnResponse(func(r *colly.Response) {
fileName :=""
caption :=strings.Split(r.Request.URL.String(),"#") // Get just # The following information
if len(caption)>=2{ // Here we need to judge the situation without information , Or the slice will cross the line
fileName =caption[1] +".jpg"
}else{
fileName = " Unknown "
}
res, err := url.QueryUnescape(fileName) // Yes url Format conversion , Otherwise I can't understand
fileName = strings.Replace(res,",","_",-1)// Replace all commas in the message with the next line , Comma file naming will cause an error .
fmt.Printf(" download -->%s \n",fileName)
f, err := os.Create("./download/"+fileName)
if err != nil {
panic(err)
}
io.Copy(f, bytes.NewReader(r.Body))
})
// structure URL
pageSize:= 200 // Number of images to download ,
pageNum :=10
for i:=0;i<pageNum;i++{
url :=fmt.Sprintf("https://www.quanjing.com/Handler/SearchUrl.ashx?t=1952&callback=searchresult&q= cartoon A&stype=1&pagesize=%d&pagenum=%d&imageType=2&imageColor=&brand=&imageSType=&fr=1&sortFlag=1&imageUType=&btype=&authid=&_=1577435470818",pageSize,i)
_ = c.Visit(url)
}
c.Wait()
imageC.Wait()
fmt.Printf("done,cost:%s\n",time.Since(t))
}
design sketch

Tomorrow, , Let's have a look colly What are the fun little projects !
边栏推荐
- [interview question] the difference between instancof and getclass()
- im即时通讯开发应用保活之进程防杀
- 离散数学及其应用 2018-2019学年春夏学期期末考试 习题详解
- C# Winform 最大化遮挡任务栏和全屏显示问题
- Design and practice of vivo server monitoring architecture
- Analysis report on development trend and investment forecast of global and Chinese D-leucine industry from 2022 to 2028
- Adding, deleting, modifying and checking in low build code
- C程序设计专题 18-19年期末考试习题解答(下)
- svg线条动画背景js特效
- svg+js键盘控制路径
猜你喜欢

∞ symbol line animation canvasjs special effect

时间统一系统

JPA learning 2 - core annotation, annotation addition, deletion, modification and query, list query result return type, one to many, many to one, many to many

信号完整性(SI)电源完整性(PI)学习笔记(二十五)差分对与差分阻抗(五)

Zed acquisition

融合模型权限管理设计方案

Svg+js keyboard control path

On the difficulty of developing large im instant messaging system

Tutorial details | how to edit and set the navigation function in the coolman system?

Why are life science enterprises on the cloud in succession?
随机推荐
Phprunner 10.7.0 PHP code generator
Signal integrity (SI) power integrity (PI) learning notes (XXV) differential pair and differential impedance (V)
How can I persuade leaders to use DDD to construct the liver project?
Meta&伯克利基于池化自注意力机制提出通用多尺度视觉Transformer,在ImageNet分类准确率达88.8%!开源...
Canvas spiral style animation JS special effect
无需显示屏的VNC Viewer远程连接树莓派
I suddenly find that the request dependent package in NPM has been discarded. What should I do?
Analysis report on development trend and investment forecast of global and Chinese D-leucine industry from 2022 to 2028
同济、阿里获CVPR最佳学生论文,李飞飞获黄煦涛奖,近6000人线下参会
5-minute NLP: summary of 3 pre training libraries for rapid realization of NER
After 5 years of software testing in didi and ByteDance, it's too real
Sitelock helps you with the top ten common website security risks
Fast pace? high pressure? VR panoramic Inn brings you a comfortable life
svg线条动画背景js特效
信号完整性(SI)电源完整性(PI)学习笔记(一)信号完整性分析概论
Zed acquisition
Overview of medium and low speed aerospace electronic bus
水库大坝安全监测
离散数学及其应用 2018-2019学年春夏学期期末考试 习题详解
教程详解|在酷雷曼系统中如何编辑设置导览功能?