当前位置:网站首页>Pyhton crawls to Adu (Li Yifeng) Weibo comments
Pyhton crawls to Adu (Li Yifeng) Weibo comments
2022-06-24 07:40:00 【Python researcher】
Today's goal : Microblogging , Take Li Yifeng's microblog as an example :
https://weibo.com/liyifeng2007?is_all=1
Then go to the comment page , Get into XHR Find the real address :
https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo
Obviously , Is dynamic , Grabbing is also based on the methods I wrote before , I won't say it , The most important thing here is the string of numbers , So as long as we find the string of numbers in the first website, we will be half successful , This time we need to use re Regular , Um. , I'm not good at this , But it's okay , It should still be available :
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
print(real_id,filename)Output is as follows :
The first is what we need ID, Then there is the time of microblogging , We use it as the file name for storing comment data .
And then we put ID Pass in the second URL :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'Of course, this is to capture the heat , If you want to capture the latest reply , You need this :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'It's easy to get this ,JSON data , Direct entry json Website analysis is OK , Then find the data we need , Here is the code directly :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
re I don't know how to use it , Let's make do with it , Mainly to get the data in hand , This is the most important , ha-ha …
Compare the :
Remove those expressions , Some people who can't write just by expression will only show their names , This is normal , The rest is a hair .
We got the data , Let's store it locally , All the code :
# -*- coding: utf-8 -*-
"""
Created on 2020-11-18
@author: Li Yunchen
"""
#https://weibo.com/liyifeng2007?is_all=1
import requests
import re,os
url = 'https://s.weibo.com/?topnav=1&wvr=6'
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
# print(real_id,filename)
# print(filename)
for page in range(1,11):
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
# conmment = re.findall('</i></a>(.*?) </div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
f_name = "./images/"+filename
with open(f_name+"_ Li Yunchen .txt","a",encoding="utf-8") as f:
f.write(each)
f.write("\n")
Just a test , So I only climbed more than ten pages :
After climbing down, you can compare yourself :
Get it done !!!!
边栏推荐
- Global and Chinese market of anion sanitary napkins 2022-2028: Research Report on technology, participants, trends, market size and share
- 20 not to be missed ES6 tips
- Unity Culling 相关技术
- [image feature extraction] image feature extraction based on pulse coupled neural network (PCNN) including Matlab source code
- Bjdctf 2020 Bar _ Babystack
- Global and Chinese market of digital fryer 2022-2028: Research Report on technology, participants, trends, market size and share
- [Lua language from bronze to king] Part 2: development environment construction +3 editor usage examples
- 2、 What is the principle of layer 3 and 4 switching technology? Recommended collection!
- New ways to play web security [6] preventing repeated use of graphic verification codes
- Global and Chinese markets for food puffers 2022-2028: Research Report on technology, participants, trends, market size and share
猜你喜欢

MaxCompute远程连接,上传、下载数据文件操作
![buuctf misc [UTCTF2020]docx](/img/e4/e160f704d6aa754e85056840e14bd2.png)
buuctf misc [UTCTF2020]docx
![[OGeek2019]babyrop](/img/74/5f93dcee9ea5a562a7fba5c17aab76.png)
[OGeek2019]babyrop

Blue Bridge Cup seven segment code (dfs/ shape pressing + parallel search)

Canal installation configuration

buuctf misc 从娃娃抓起

Étalonnage de la caméra (objectif et principe d'étalonnage)

Ultra wideband pulse positioning scheme, UWB precise positioning technology, wireless indoor positioning application
![[image fusion] image fusion based on pseudo Wigner distribution (PWD) with matlab code](/img/e0/14cd7982fb3059fed235470d91126e.png)
[image fusion] image fusion based on pseudo Wigner distribution (PWD) with matlab code

RDD basic knowledge points
随机推荐
第三方软件测试公司如何选择?2022国内软件测试机构排名
2、 What is the principle of layer 3 and 4 switching technology? Recommended collection!
jarvisoj_level2
阿里云全链路数据治理
Knowledge points of 2022 system integration project management engineer examination: ITSS information technology service
伦敦金的资金管理比其他都重要
什么是CC攻击?如何判断网站是否被CC攻击? CC攻击怎么防御?
Win10 build webservice
MSSQL high permission injection write horse to Chinese path
Session & cookie details
New ways to play web security [6] preventing repeated use of graphic verification codes
A summary of the posture of bouncing and forwarding around the firewall
Deploy L2TP in VPN (Part 1)
[DDCTF2018](╯°□°)╯︵ ┻━┻
2.1.1 QML grammar foundation I
Fine! Storage knowledge is a must for network engineers!
Buuctf misc grab from the doll
Global and Chinese market of bed former 2022-2028: Research Report on technology, participants, trends, market size and share
软件性能测试分析与调优实践之路-JMeter对RPC服务的性能压测分析与调优-手稿节选
What is automated testing? What software projects are suitable for automated testing?