当前位置:网站首页>Pyhton crawls to Adu (Li Yifeng) Weibo comments
Pyhton crawls to Adu (Li Yifeng) Weibo comments
2022-06-24 07:40:00 【Python researcher】
Today's goal : Microblogging , Take Li Yifeng's microblog as an example :
https://weibo.com/liyifeng2007?is_all=1
Then go to the comment page , Get into XHR Find the real address :
https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo
Obviously , Is dynamic , Grabbing is also based on the methods I wrote before , I won't say it , The most important thing here is the string of numbers , So as long as we find the string of numbers in the first website, we will be half successful , This time we need to use re Regular , Um. , I'm not good at this , But it's okay , It should still be available :
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
print(real_id,filename)Output is as follows :
The first is what we need ID, Then there is the time of microblogging , We use it as the file name for storing comment data .
And then we put ID Pass in the second URL :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'Of course, this is to capture the heat , If you want to capture the latest reply , You need this :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'It's easy to get this ,JSON data , Direct entry json Website analysis is OK , Then find the data we need , Here is the code directly :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
re I don't know how to use it , Let's make do with it , Mainly to get the data in hand , This is the most important , ha-ha …
Compare the :
Remove those expressions , Some people who can't write just by expression will only show their names , This is normal , The rest is a hair .
We got the data , Let's store it locally , All the code :
# -*- coding: utf-8 -*-
"""
Created on 2020-11-18
@author: Li Yunchen
"""
#https://weibo.com/liyifeng2007?is_all=1
import requests
import re,os
url = 'https://s.weibo.com/?topnav=1&wvr=6'
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
# print(real_id,filename)
# print(filename)
for page in range(1,11):
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
# conmment = re.findall('</i></a>(.*?) </div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
f_name = "./images/"+filename
with open(f_name+"_ Li Yunchen .txt","a",encoding="utf-8") as f:
f.write(each)
f.write("\n")
Just a test , So I only climbed more than ten pages :
After climbing down, you can compare yourself :
Get it done !!!!
边栏推荐
- What should I pay attention to after the live broadcast system source code is set up?
- [learn FPGA programming from scratch -42]: Vision - technological evolution of chip design in the "post Moorish era" - 1 - current situation
- The first common node of two linked lists_ The entry of the link in the linked list (Sword finger offer)
- Only two lines are displayed, and the excess part is displayed with Ellipsis
- 相机标定(标定目的、原理)
- [WUSTCTF2020]爬
- 使用SystemParametersInfo访问用户界面设置
- 2、 What is the principle of layer 3 and 4 switching technology? Recommended collection!
- When MFC uses the console, the project path cannot have spaces or Chinese, otherwise an error will be reported. Lnk1342 fails to save the backup copy of the binary file to be edited, etc
- 图形技术之坐标转换
猜你喜欢

Tutorial on simple use of Modbus to BACnet gateway

Accessing user interface settings using systemparametersinfo

现货黄金有哪些值得借鉴的心态

20 not to be missed ES6 tips
![[image fusion] image fusion based on pseudo Wigner distribution (PWD) with matlab code](/img/e0/14cd7982fb3059fed235470d91126e.png)
[image fusion] image fusion based on pseudo Wigner distribution (PWD) with matlab code

光照使用的简单总结

Blue Bridge Cup seven segment code (dfs/ shape pressing + parallel search)
![[image fusion] image fusion based on directional discrete cosine transform and principal component analysis with matlab code](/img/21/a5a973f06ea002755a8a2a4431dcd8.png)
[image fusion] image fusion based on directional discrete cosine transform and principal component analysis with matlab code

Prefix and topic training

buuctf misc 从娃娃抓起
随机推荐
Common coding and encryption in penetration testing
UE常用控制臺命令
Actual target shooting - skillfully use SMB to take down the off-line host
get_started_3dsctf_2016
Knowledge points of 2022 system integration project management engineer examination: ITSS information technology service
【Vulhub靶场】】zabbix-SQL注入(CVE-2016-10134)漏洞复现
[从零开始学习FPGA编程-41]:视野篇 - 摩尔时代与摩尔定律以及后摩尔时代的到来
PCL calculates the area of a polygon
Serviceworker working mechanism and life cycle: resource caching and collaborative communication processing
2、 What is the principle of layer 3 and 4 switching technology? Recommended collection!
L2tp/ipsec one click installation script
What is automated testing? What software projects are suitable for automated testing?
简单使用Modbus转BACnet网关教程
The fund management of London gold is more important than others
The initial user names and passwords of Huawei devices are a large collection that engineers involved in Huawei business should keep in mind and collect!
Obtain the package name, application name, icon, etc. of the uninstalled APK through packagemanager. There is a small message
6000多万铲屎官,捧得出一个国产主粮的春天吗?
[vulhub shooting range]] ZABBIX SQL injection (cve-2016-10134) vulnerability recurrence
[WUSTCTF2020]alison_likes_jojo
What should I pay attention to after the live broadcast system source code is set up?