当前位置:网站首页>Go crawler framework -colly actual combat (IV) -- Zhihu answer crawl (I)

Go crawler framework -colly actual combat (IV) -- Zhihu answer crawl (I)

2022-06-25 00:17:00 You're like an ironclad treasure

Original link :Hzy Blog

1. Preface

I haven't written for several days , I found out these two days , Every time I write a crawler, I have to paste and copy it myself cookie, I feel so troublesome ,colly There is one setCookies, I didn't know how to use it before , Now I see .

siteCokkie :=c.Cookies(URL string)
c.SetCookies(URL string,siteCokkie)

It looks like , You can set a url At the time of the visit cookie La ,cookies It is usually the last request cookies, Then we choose whether to modify according to the situation cookies.

2. I know the above topic , What are some good-looking fan dramas recommended , I thought of climbing down with a reptile , And then count out what good-looking dramas there are .( I have seen almost all the good-looking dramas . A bit of a famine …)

problem : Is there any good play ( Japanese TV animation 、 Network animation 、OVA/OAD Serial works ) Do you ?

Today, let's climb down all the questions below , Tomorrow, I will clean the data , Make statistics !!!
Because it's already twelve o'clock … I don't want to be bald .

3. alike colly frame , Just a simple request , Write to the file and you're done ! Go straight to the code .

Some considerations and processes :

  • It seems like every request limt It seems to be limited to 20.
  • So pinch , My thoughts , Request to find... At one time totals, You can know how many answers there are .
  • And then every time 20 individual ,20 One catch , Just put it in the file .
package main

import (
	"encoding/json"
	"fmt"
	"github.com/PuerkitoBio/goquery"
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/extensions"
	"os"
	"strings"
)

func main(){
	file, error := os.OpenFile("./answer.txt", os.O_RDWR|os.O_CREATE, 0766) // create a file 
	if error != nil {
		fmt.Println(error)
	}
	defer file.Close()
	total := 20 // Know that every time you limit the return 20 answer 
	i:=0 // The record is the number of answers 
	c:=colly.NewCollector(func(collector *colly.Collector) {
		extensions.RandomUserAgent(collector)
	})
	c.OnRequest(func(request *colly.Request) {
		fmt.Printf("fetch --->%s\n",request.URL.String())
	})
	c.OnResponse(func(response *colly.Response) {

		var f interface{}
		json.Unmarshal(response.Body,&f) // Deserialization 
		//  Find out the total number of answers under the question 
		paging :=f.(map[string]interface{})["paging"]
		total = int(paging.(map[string]interface{})["totals"].(float64))
		//  Find the current url Return all the answers in the data .
		data :=f.(map[string]interface{})["data"]
		for k,v :=range data.([]interface{}){
			content :=v.(map[string]interface{})["content"]
			reader :=strings.NewReader(content.(string))
			doc,_:=goquery.NewDocumentFromReader(reader)
			file.Write([]byte(fmt.Sprintf("%d:%s\n",i+k,doc.Find("p").Text())))
		}

	})
	questionID := "319017029"
	for ;i<=total;i+=20{
		//c.Visit()
		url :=fmt.Sprintf("https://www.zhihu.com/api/v4/questions/%s/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled,is_recognized,paid_info,paid_info_content;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset=%d&limit=%d&sort_by=updated",questionID,i,20)
		c.Visit(url)
	}
}

4. Tomorrow, I will capture the data , Do some visual analysis , Or statistics ,go There should also be a library for this , Look around tomorrow !!

原网站

版权声明
本文为[You're like an ironclad treasure]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210551199021.html