当前位置:网站首页>Pyhton crawls to Adu (Li Yifeng) Weibo comments
Pyhton crawls to Adu (Li Yifeng) Weibo comments
2022-06-24 07:40:00 【Python researcher】
Today's goal : Microblogging , Take Li Yifeng's microblog as an example :
https://weibo.com/liyifeng2007?is_all=1
Then go to the comment page , Get into XHR Find the real address :
https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo
Obviously , Is dynamic , Grabbing is also based on the methods I wrote before , I won't say it , The most important thing here is the string of numbers , So as long as we find the string of numbers in the first website, we will be half successful , This time we need to use re Regular , Um. , I'm not good at this , But it's okay , It should still be available :
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
print(real_id,filename)Output is as follows :
The first is what we need ID, Then there is the time of microblogging , We use it as the file name for storing comment data .
And then we put ID Pass in the second URL :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'Of course, this is to capture the heat , If you want to capture the latest reply , You need this :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'It's easy to get this ,JSON data , Direct entry json Website analysis is OK , Then find the data we need , Here is the code directly :
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
re I don't know how to use it , Let's make do with it , Mainly to get the data in hand , This is the most important , ha-ha …
Compare the :
Remove those expressions , Some people who can't write just by expression will only show their names , This is normal , The rest is a hair .
We got the data , Let's store it locally , All the code :
# -*- coding: utf-8 -*-
"""
Created on 2020-11-18
@author: Li Yunchen
"""
#https://weibo.com/liyifeng2007?is_all=1
import requests
import re,os
url = 'https://s.weibo.com/?topnav=1&wvr=6'
target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html):
real_id = each.split(" ")[0]
filename = each.split("\\")[-2].replace('"',"").replace(":",".")
# print(real_id,filename)
# print(filename)
for page in range(1,11):
comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# Extract reviewers and comments
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
# conmment = re.findall('</i></a>(.*?) </div>', res)
for each in conmment:
# take Replace those expressions in the content
each = re.sub('<.*?>','',each)
print(each)
f_name = "./images/"+filename
with open(f_name+"_ Li Yunchen .txt","a",encoding="utf-8") as f:
f.write(each)
f.write("\n")
Just a test , So I only climbed more than ten pages :
After climbing down, you can compare yourself :
Get it done !!!!
边栏推荐
- MySQL case: analysis of full-text indexing
- Prefix and topic training
- [cnpm] tutorial
- Commandes de console communes UE
- When MFC uses the console, the project path cannot have spaces or Chinese, otherwise an error will be reported. Lnk1342 fails to save the backup copy of the binary file to be edited, etc
- Global and Chinese market of basketball uniforms 2022-2028: Research Report on technology, participants, trends, market size and share
- Dichotomous special training
- 图形技术之坐标转换
- Global and Chinese market of inline drip irrigation 2022-2028: Research Report on technology, participants, trends, market size and share
- 【Vulhub靶场】】zabbix-SQL注入(CVE-2016-10134)漏洞复现
猜你喜欢

How can win11 set the CPU performance to be fully turned on? How does win11cpu set high performance mode?

图形技术之管线概念

相機標定(標定目的、原理)

Ultra wideband pulse positioning scheme, UWB precise positioning technology, wireless indoor positioning application

Only two lines are displayed, and the excess part is displayed with Ellipsis

bjdctf_2020_babystack

软件性能测试分析与调优实践之路-JMeter对RPC服务的性能压测分析与调优-手稿节选

Buuctf misc grab from the doll
![[vulhub shooting range]] ZABBIX SQL injection (cve-2016-10134) vulnerability recurrence](/img/c5/f548223666d7379a7d4aaed2953587.png)
[vulhub shooting range]] ZABBIX SQL injection (cve-2016-10134) vulnerability recurrence

What are the dazzling skills of spot gold?
随机推荐
2.1.1 QML grammar foundation I
What is automated testing? What software projects are suitable for automated testing?
Virtual machine security disaster recovery construction
[frame rate doubling] development and implementation of FPGA based video frame rate doubling system Verilog
图形技术之坐标转换
图形技术之管线概念
RDD基础知识点
PCL point cloud random sampling by ratio
[WUSTCTF2020]alison_likes_jojo
[signal recognition] signal modulation classification based on deep learning CNN with matlab code
Selector (>, ~, +, [])
Global and Chinese market of inline drip irrigation 2022-2028: Research Report on technology, participants, trends, market size and share
[机缘参悟-29]:鬼谷子-内揵篇-与上司交往的五种层次
Ultra wideband pulse positioning scheme, UWB precise positioning technology, wireless indoor positioning application
[learn FPGA programming from scratch -42]: Vision - technological evolution of chip design in the "post Moorish era" - 1 - current situation
只显示两行,超出部分省略号显示
[image fusion] image fusion based on NSST and PCNN with matlab code
[GUET-CTF2019]zips
Common coding and encryption in penetration testing
[MySQL usage Script] clone data tables, save query data to data tables, and create temporary tables