当前位置:网站首页>Crawler crawls Sina Weibo data
Crawler crawls Sina Weibo data
2022-06-25 03:54:00 【Blockchain research】
Tools : Cloud gathering reptile
The goal is : Capture all microblogs of a blogger
Analyze the structure of the web page :
The idea of our crawling is to simulate the browser to automatically access the page crawling .
Let's take a look at the page structure , First, each Weibo list , Three or four pull-down loads are required , If there is a page turning button at the bottom , Then judge that this page is loaded .

Login problem
To crawl, you need to log in , How to login ?
No verification code is required for login , If you make a mistake , Will ask you to enter the verification code , So there is no technical difficulty in logging in .
We can create one 【 Login module 】, First log in with a browser , In the future, all pages will be shared based on this browser cookie Go grab it .

Flow chart design :

We don't need the details page of Weibo . So there is no detail page for the whole crawler process , The data is extracted from the list .
Crawling results :
Total cost 5 Minutes of time , Grab it 10 A page , 400 microblogs in total . Because my microblog is not posted very often .
The data are as follows :

Make a simple word cloud :

边栏推荐
- The release function completed 02 "IVX low code sign in system production"
- DateTimeFormat放到@RequestBody下是无效的
- 【组队学习】SQL编程语言笔记——Task04
- Nacos practice record
- Rebeco: using machine learning to predict stock crash risk
- Comprehensive operation of financial risk management X of Dongcai
- Break the memory wall with CPU scheme? Learn from PayPal stack to expand capacity, and the volume of missed fraud transactions can be reduced to 1/30
- Copilot免费时代结束!正式版67元/月,学生党和热门开源项目维护者可白嫖
- 9 necessary soft skills for program ape career development
- 存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能
猜你喜欢

【组队学习】SQL编程语言笔记——Task04

Two common OEE monitoring methods for equipment utilization

What is an SSL certificate and what are the benefits of having an SSL certificate?

北大换新校长!中国科学院院士龚旗煌接任,15岁考上北大物理系

腾讯开源项目「应龙」成Apache顶级项目:前身长期服务微信支付,能hold住百万亿级数据流处理...

孙武玩《魔兽》?有图有真相

AI writes its own code to let agents evolve! The big model of openai has the flavor of "human thought"

ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language

用CPU方案打破内存墙?学PayPal堆傲腾扩容量,漏查欺诈交易量可降至1/30

完美洗牌问题
随机推荐
Google founder brin's second marriage broke up: it was revealed that he had filed for divorce from his Chinese wife in January, and his current fortune is $631.4 billion
How to choose a securities company when opening an account with a compass? Which is safer
Tai Chi graphics 60 lines of code to achieve classic papers, 0.7 seconds to get Poisson disk sampling, 100 times faster than numpy
Is it safe to open an account online? Online and other answers
TCC mode explanation and code implementation of Seata's four modes
Tutorial on installing SSL certificates in Microsoft Exchange Server 2007
Install ffmpeg in LNMP environment and use it in yii2
Standing at the center of the storm: how to change the engine of Tencent
The release function completed 02 "IVX low code sign in system production"
Copilot免费时代结束!正式版67元/月,学生党和热门开源项目维护者可白嫖
Randla net: efficient semantic segmentation of large scale point clouds
Tianshu night reading notes - 8.4 diskperf disassembly
Oracle-sqlload import external data details
【Rust投稿】捋捋 Rust 中的 impl Trait 和 dyn Trait
Maybe it's the wrong reason
存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能
Internet Explorer died, and netizens started to build a true tombstone
完美洗牌问题
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
Jilin University 22 spring March "career design" assignment assessment-00072