当前位置：网站首页>Pyhton crawls to Adu (Li Yifeng) Weibo comments

Pyhton crawls to Adu (Li Yifeng) Weibo comments

2022-06-24 07:40:00 【Python researcher】

Today's goal ： Microblogging , Take Li Yifeng's microblog as an example ：

https://weibo.com/liyifeng2007?is_all=1

Then go to the comment page , Get into XHR Find the real address ：

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo

Obviously , Is dynamic , Grabbing is also based on the methods I wrote before , I won't say it , The most important thing here is the string of numbers , So as long as we find the string of numbers in the first website, we will be half successful , This time we need to use re Regular , Um. , I'm not good at this , But it's okay , It should still be available ：

target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
 
 
html = requests.get(target,headers=headers).text
 
 
for each in re.findall('<a name=(.*?)date=',html):
    real_id = each.split(" ")[0]
    filename = each.split("\\")[-2].replace('"',"").replace(":",".")
    print(real_id,filename)

Output is as follows ：

The first is what we need ID, Then there is the time of microblogging , We use it as the file name for storing comment data .

And then we put ID Pass in the second URL ：

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'

Of course, this is to capture the heat , If you want to capture the latest reply , You need this ：

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'

It's easy to get this ,JSON data , Direct entry json Website analysis is OK , Then find the data we need , Here is the code directly ：

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'
res = requests.get(comment_url,headers=headers).json()["data"]["html"]
 
 
#  Extract reviewers and comments 
conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
 
 
for each in conmment:
      #  take   Replace those expressions in the content 
      each = re.sub('<.*?>','',each)
      print(each)

re I don't know how to use it , Let's make do with it , Mainly to get the data in hand , This is the most important , ha-ha …

Compare the ：

Remove those expressions , Some people who can't write just by expression will only show their names , This is normal , The rest is a hair .

We got the data , Let's store it locally , All the code ：

# -*- coding: utf-8 -*-
"""
Created on 2020-11-18
@author:  Li Yunchen 
"""
 
 
#https://weibo.com/liyifeng2007?is_all=1
 
 
import requests
import re,os
 
 
url = 'https://s.weibo.com/?topnav=1&wvr=6'
target = 'https://weibo.com/liyifeng2007?is_all=1'
 
 
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'
}
 
 
html = requests.get(target,headers=headers).text
 
 
for each in re.findall('<a name=(.*?)date=',html):
    real_id = each.split(" ")[0]
    filename = each.split("\\")[-2].replace('"',"").replace(":",".")
    # print(real_id,filename)
 
 
    # print(filename)
    for page in range(1,11):
        comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}'
        res = requests.get(comment_url,headers=headers).json()["data"]["html"]
 
 
        #  Extract reviewers and comments 
        conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
        # conmment = re.findall('</i></a>(.*?) </div>', res)
        for each in conmment:
            #  take   Replace those expressions in the content 
            each = re.sub('<.*?>','',each)
            print(each)
            f_name = "./images/"+filename
            with open(f_name+"_ Li Yunchen .txt","a",encoding="utf-8") as f:
                f.write(each)
                f.write("\n")

Just a test , So I only climbed more than ten pages ：

After climbing down, you can compare yourself ：

Get it done ！！！！

原网站

版权声明
本文为[Python researcher]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/06/20210629183300715s.html