当前位置:网站首页>Go crawler framework -colly actual combat (IV) -- Zhihu answer crawl (I)
Go crawler framework -colly actual combat (IV) -- Zhihu answer crawl (I)
2022-06-25 00:17:00 【You're like an ironclad treasure】
Original link :Hzy Blog
1. Preface
I haven't written for several days , I found out these two days , Every time I write a crawler, I have to paste and copy it myself cookie, I feel so troublesome ,colly There is one setCookies, I didn't know how to use it before , Now I see .
siteCokkie :=c.Cookies(URL string)
c.SetCookies(URL string,siteCokkie)
It looks like , You can set a url At the time of the visit cookie La ,cookies It is usually the last request cookies, Then we choose whether to modify according to the situation cookies.
2. I know the above topic , What are some good-looking fan dramas recommended , I thought of climbing down with a reptile , And then count out what good-looking dramas there are .( I have seen almost all the good-looking dramas . A bit of a famine …)
problem : Is there any good play ( Japanese TV animation 、 Network animation 、OVA/OAD Serial works ) Do you ?
Today, let's climb down all the questions below , Tomorrow, I will clean the data , Make statistics !!!
Because it's already twelve o'clock … I don't want to be bald .
3. alike colly frame , Just a simple request , Write to the file and you're done ! Go straight to the code .
Some considerations and processes :
- It seems like every request limt It seems to be limited to 20.
- So pinch , My thoughts , Request to find... At one time totals, You can know how many answers there are .
- And then every time 20 individual ,20 One catch , Just put it in the file .
package main
import (
"encoding/json"
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/gocolly/colly"
"github.com/gocolly/colly/extensions"
"os"
"strings"
)
func main(){
file, error := os.OpenFile("./answer.txt", os.O_RDWR|os.O_CREATE, 0766) // create a file
if error != nil {
fmt.Println(error)
}
defer file.Close()
total := 20 // Know that every time you limit the return 20 answer
i:=0 // The record is the number of answers
c:=colly.NewCollector(func(collector *colly.Collector) {
extensions.RandomUserAgent(collector)
})
c.OnRequest(func(request *colly.Request) {
fmt.Printf("fetch --->%s\n",request.URL.String())
})
c.OnResponse(func(response *colly.Response) {
var f interface{}
json.Unmarshal(response.Body,&f) // Deserialization
// Find out the total number of answers under the question
paging :=f.(map[string]interface{})["paging"]
total = int(paging.(map[string]interface{})["totals"].(float64))
// Find the current url Return all the answers in the data .
data :=f.(map[string]interface{})["data"]
for k,v :=range data.([]interface{}){
content :=v.(map[string]interface{})["content"]
reader :=strings.NewReader(content.(string))
doc,_:=goquery.NewDocumentFromReader(reader)
file.Write([]byte(fmt.Sprintf("%d:%s\n",i+k,doc.Find("p").Text())))
}
})
questionID := "319017029"
for ;i<=total;i+=20{
//c.Visit()
url :=fmt.Sprintf("https://www.zhihu.com/api/v4/questions/%s/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset=%d&limit=%d&sort_by=updated",questionID,i,20)
c.Visit(url)
}
}
4. Tomorrow, I will capture the data , Do some visual analysis , Or statistics ,go There should also be a library for this , Look around tomorrow !!
边栏推荐
- canvas螺旋样式的动画js特效
- 5-minute NLP: summary of 3 pre training libraries for rapid realization of NER
- Creative SVG ring clock JS effect
- C程序设计专题 15-16年期末考试习题解答(上)
- Global and Chinese 3-Chlorobenzaldehyde industry operation mode and future development trend report 2022 ~ 2028
- Current situation and development prospect forecast report of global and Chinese tetrahydrofurfuryl alcohol acetate industry from 2022 to 2028
- Ansible及playbook的相关操作
- 教程详解|在酷雷曼系统中如何编辑设置导览功能?
- 离散数学及其应用 2018-2019学年春夏学期期末考试 习题详解
- 颜色渐变梯度颜色集合
猜你喜欢

Im instant messaging development application keeping alive process anti kill
[distributed system design profile (2)] kV raft

UE4 WebBrowser chart cannot display problems

I suddenly find that the request dependent package in NPM has been discarded. What should I do?

Fast pace? high pressure? VR panoramic Inn brings you a comfortable life

技术分享| WVP+ZLMediaKit实现摄像头GB28181推流播放

无人驾驶: 对多传感器融合的一些思考

Eye gaze estimation using webcam

Human body transformation vs digital Avatar

Arbitrary file download of file operation vulnerability (7)
随机推荐
Creative SVG ring clock JS effect
Wx applet jump page
软件测试与游戏测试文章合集录
How to use promise Race() and promise any() ?
5-minute NLP: summary of 3 pre training libraries for rapid realization of NER
Hibernate learning 2 - lazy loading (delayed loading), dynamic SQL parameters, caching
Interesting checkbox counters
[Solved] Public key for mysql-community-xxx. rpm is not installed
∞符号线条动画canvasjs特效
Common redis commands in Linux system
干接点和湿接点
Why do more and more physical stores use VR panorama? What are the advantages?
Current situation analysis and development trend forecast report of global and Chinese acrylonitrile butadiene styrene industry from 2022 to 2028
教程详解|在酷雷曼系统中如何编辑设置导览功能?
Do280openshift access control -- encryption and configmap
What exactly is Nacos
Analysis report on operation mode and future development of global and Chinese methyl cyclopentanoate industry from 2022 to 2028
Sitelock helps you with the top ten common website security risks
走近Harvest Moon:Moonbeam DeFi狂欢会
【排行榜】Carla leaderboard 排行榜 运行与参与手把手教学