当前位置:网站首页>网络爬虫2:抓取网易云音乐评论用户ID及主页地址
网络爬虫2:抓取网易云音乐评论用户ID及主页地址
2022-06-26 21:28:00 【浪漫的数据分析】
本文目标:
上篇我们获得了热门歌手歌曲的ID和URL地址。本篇进一步获取评论用户ID及主页地址。终极目标:
1、通过热门歌手,抓取歌曲ID。
2、通过歌曲ID,抓取评论用户ID。
3、通过评论用户ID,发送定向推送消息。
上一篇完成了步骤1,本文完成步骤2。
题外话:上篇用的requests无页面的方法获取歌曲ID,速度比较快,但是获取到2000条左右就会被服务器识别成爬虫而被禁IP,通过连接手机热点,重启飞行模式后再连接就又可以获取2000条。
上篇我们用MYSQL存储爬取结果,本次也将使用相同方法,同时本篇将支持错误重做,每处理完一条记录就打一个处理标志位Y,和我们生产系统的做法类似。
步骤1:建mysql的表
这里又需要创建一个叫userinf的表,存储用户的ID和评论时间,主页地址。
建表语句如下:
DROP TABLE IF EXISTS `userinf`;
CREATE TABLE `userinf` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`user_name` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_time` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_url` varchar(400) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`bysz` float(3, 0) NULL DEFAULT 0.00,
PRIMARY KEY (`id`) USING BTREE,
INDEX `user_id`(`user_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
创建完以后,我们需要创建一个python程序来插入这个表。
python程序命名为:useridSpiderSQL.py,代码为:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
#创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def modify_sql(self,sql,data):
self.cursor.execute(sql,data)
self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def insert_userinf(user_id,user_name,user_time,user_url,clbz):
helper = Mysql_pq()
print('连接上了数据库python,准备插入歌曲信息')
# 插入数据
insert_sql = 'insert into userinf(user_id,user_name,user_time,user_url,clbz) value (%s,%s,%s,%s,%s)'
data = (user_id,user_name,user_time,user_url,clbz)
helper.modify_sql(insert_sql, data)
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# #测试
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
user_id='519250015'
user_name= '请记住我'
user_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
user_time = '2021年2月18日'
clbz = 'N'
insert_userinf(user_id,user_name,user_time,user_url,clbz)
print('test over')
支持错误重做:回头更新songinf表
为了错误重做,我们处理完一条songinf就更新处理标志为Y,当出错时,程序自动跳过处理标志位Y的记录,仅仅处理处理标志位N的记录,这样就可以完成接力。
所以为了完成这个接力,我们需要在爬完一首歌的评论用户后,回头更新songinf。我们需要创建一个python程序来插入这个表。
python程序命名为:updateSongURLSQL.py,代码为:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
#创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def __del__(self):
self.cursor.close()
self.conn.close()
def updater_songurl(url):
helper = Mysql_pq()
print('连接上了数据库python,准备插入歌曲信息')
sql = "UPDATE songinf SET clbz = 'Y' WHERE song_url= '%s'" % (url)
print('sql is :', sql)
helper.cursor.execute(sql)
helper.conn.commit()
if __name__ == '__main__':
url = 'https://music.163.com/#/song?id=569213220&lv=-1&kv=-1&tv=-1'
updater_songurl(url)
print('urllist = ',url )
print('update over')
爬取评论用户:
为了防止被服务器禁,我们本次使用selenium自动化控制模块来控制浏览器访问,这样服务器无法区分是爬虫还是用户的访问,缺点是速度比较慢,目前爬取速度大概为1小时1000条用户数据。
运行了一晚上,目前获得10万+用户ID。这里需要使用上篇获取的song的URL信息,我们需要创建一个python程序来插入这个表。
python程序命名为:getSongURLSQL.py,代码为:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
#创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
# def modify_sql(self,sql,data):
# self.cursor.execute(sql,data)
# self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def select_songurl():
helper = Mysql_pq()
print('连接上了数据库python,准备插入歌曲信息')
urllist = []
sql = "SELECT * FROM songinf WHERE clbz = 'N'"
helper.cursor.execute(sql)
results = helper.cursor.fetchall()
for row in results:
id = row[0]
song_id = row[1]
song_name = row[2]
song_url = row[3]
clbz = row[4]
# 打印结果
print('id =', id)
print('song_url =',song_url)
urllist.append(song_url)
return urllist
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# #测试
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
# song_id='519250015'
# song_name= '请记住我'
# song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
# clbz = 'N'
urllist = select_songurl()
print('urllist = ',urllist )
print('test over')
所以数据库mysql很重要。
代码为:
import re
import time
import numpy as np
from flask_cors.core import LOG
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ChromeOptions
from getSongURLSQL import *
from useridSpiderSQL import *
from updateSongURLSQL import *
def is_number(s):
try:
float(s)
return True
except ValueError:
pass
try:
import unicodedata
unicodedata.numeric(s)
return True
except (TypeError, ValueError):
pass
return False
def geturl(urllist):
# 如果driver没加入环境变量中,那么就需要明确指定其路径
# 验证于2021年2月19日
driver = webdriver.Firefox()
#driver = webdriver.Chrome()
driver.maximize_window()
driver.set_page_load_timeout(30)
driver.set_window_size(1124, 850)
# locator = (By.)
for url in urllist:
print('now the url is :',url)
driver.get(url)
time.sleep(3)
print('开始登陆')
driver.switch_to.frame('g_iframe') # 网易云的音乐元素都放在框架内!!!!先切换框架
href_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'cnt f-brk')]//a[contains(@class,'s-fc7')]"
songid = driver.find_elements_by_xpath(href_xpath)
useridlist = []
usernamelist = []
for i in songid:
userurl = i.get_attribute('href')
userid = userurl[35:] #用户的id数字
print('userid = ',userid)
username = i.text
print('username = ',username)
try:
print('userid is ',userid)
if is_number(userid) : #说明是纯数字
print('用户id是数字,保留')
useridlist.append(userid)
usernamelist.append(username)
else:
iter
except (TypeError, ValueError):
print('用户id非数字,丢弃')
iter
#获取用户评论时间
commenttimelist=[]
time_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'rp')]//div[contains(@class,'time s-fc4')]"
songtime = driver.find_elements_by_xpath(time_xpath)
for itime in songtime:
#print(i.get_attribute('href'))
commenttime = itime.text
print('commenttime = ',commenttime)
commenttimelist.append(commenttime)
if len(commenttimelist)< len(useridlist):
for i in np.arange(0,len(useridlist)-len(commenttimelist),1):
commenttimelist.append('2021年2月18日')
print('len(useridlist) is = ',len(useridlist))
for i in np.arange(0,len(useridlist),1):
userid_i = useridlist[i]
username_i = usernamelist[i]
commenttime_i = commenttimelist[i]
#插入到数据库中
print('userid_i=',userid_i)
print('username_i=', username_i)
print('commenttime_i=', commenttime_i)
userurl_i ='https://music.163.com/#/user/home?id=' + str.strip(userid_i)
print('userurl_i=', userurl_i)
clbz = 'N'
try:
insert_userinf(userid_i, username_i, commenttime_i, userurl_i, clbz)
except :
print('插入数据库有错')
pass
time.sleep(5)
updater_songurl(url)
def is_login(source):
rs = re.search("CONFIG\['islogin'\]='(\d)'", source)
if rs:
return int(rs.group(1)) == 1
else:
return False
if __name__ == '__main__':
#url = 'https://music.163.com/#/discover/toplist?id=2884035'
urllist = select_songurl()
# urllist =['https://music.163.com/#/song?id=569200214&lv=-1&kv=-1&tv=-1','https://music.163.com/#/song?id=569200213&lv=-1&kv=-1&tv=-1']
geturl(urllist)
抓的结果如下:
这里有几点说明:
1、 最新评论的翻页我没有做,要做的话,需要爬取翻页的按钮并点击,然后又重新抓取用户id。
2、具体评论的内容暂时没有存储。
3、 爬取到的评论日期信息格式很不规范,需要后续处理。
下篇,将完成步骤3,就具备向10w级别的用户推送歌曲了。
边栏推荐
- leetcode刷题:字符串01(反转字符串)
- VB.net类库(进阶版——1)
- [most detailed] latest and complete redis interview (42 tracks)
- Simple Lianliankan games based on QT
- 财务费用分析怎么分析
- 股票炒股注册开户有没有什么风险?安全吗?
- Gamefi active users, transaction volume, financing amount and new projects continue to decline. Can axie and stepn get rid of the death spiral? Where is the chain tour?
- 中金证券经理给的开户二维码办理股票开户安全吗?我想开个户
- SAP Spartacus 中的依赖注入 Dependency Injection 介绍
- Leetcode question brushing: String 03 (Sword finger offer 05. replace space)
猜你喜欢

How to install mysql8.0 database under Windows system? (Graphic tutorial)

【protobuf 】protobuf 升级后带来的一些坑

Shiniman household sprint A shares: annual revenue of nearly 1.2 billion red star Macalline and incredibly home are shareholders

Distributed ID generation system

Leetcode question brushing: String 06 (implement strstr())

花店橱窗布置【动态规划】

C language 99 multiplication table

众多碎石3d材质贴图素材一键即可获取

About appium trample pit: encountered internal error running command: error: cannot verify the signature of (solved)

大家都能看得懂的源码(一)ahooks 整体架构篇
随机推荐
How to install mysql8.0 database under Windows system? (Graphic tutorial)
QT环境下配置Assimp库(MinGW编译器)
【protobuf 】protobuf 昇級後帶來的一些坑
茂莱光学科创板上市:拟募资4亿 范一与范浩兄弟为实控人
不要做巨嬰了
[most detailed] latest and complete redis interview (42 tracks)
分布式ID生成系统
Many gravel 3D material mapping materials can be obtained with one click
网易云信正式加入中国医学装备协会智慧医院分会,为全国智慧医院建设加速...
【贝叶斯分类4】贝叶斯网
后台查找,如何查找网站后台
【贝叶斯分类2】朴素贝叶斯分类器
关于appium踩坑 :Encountered internal error running command: Error: Cannot verify the signature of (已解决)
Establish a connection with MySQL
QT based "synthetic watermelon" game
Is there any risk in registering and opening an account for stock speculation? Is it safe?
0基础学c语言(3)
第2章 构建自定义语料库
VB.net类库(进阶——2 重载)
Sword finger offer 12 Path in matrix