当前位置:网站首页>Web crawler 2: crawl the user ID and home page address of Netease cloud music reviews
Web crawler 2: crawl the user ID and home page address of Netease cloud music reviews
2022-06-26 21:33:00 【Romantic data analysis】
The goal of this article :
In the last article, we won the title of a popular singer song ID and URL Address . This article further obtains the comment user ID And home page address .The ultimate goal :
1、 Through popular singers , Grab songs ID.
2、 Through songs ID, Grab comment users ID.
3、 By commenting on users ID, Send a directed push message .
The previous article completed the steps 1, This article completes the steps 2.
Digression : For the first part requests No page method to get songs ID, It's quite fast , But get 2000 The server will recognize it as a crawler and ban it IP, By connecting to a mobile hotspot , After restarting the flight mode and connecting, you can get 2000 strip .
In the first part, we use MYSQL Store crawl results , The same method will be used this time , At the same time, this article will support error redoing , Each time a record is processed, a processing flag bit will be marked Y, Similar to our production system .
step 1: build mysql Table of
Here you need to create another one called userinf Table of , Store user's ID And comment time , Home address .
Build the predicative sentence as follows :
DROP TABLE IF EXISTS `userinf`;
CREATE TABLE `userinf` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`user_name` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_time` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`user_url` varchar(400) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
`bysz` float(3, 0) NULL DEFAULT 0.00,
PRIMARY KEY (`id`) USING BTREE,
INDEX `user_id`(`user_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
After creation , We need to create one python Program to insert this table .
python The program is named :useridSpiderSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def modify_sql(self,sql,data):
self.cursor.execute(sql,data)
self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def insert_userinf(user_id,user_name,user_time,user_url,clbz):
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
# insert data
insert_sql = 'insert into userinf(user_id,user_name,user_time,user_url,clbz) value (%s,%s,%s,%s,%s)'
data = (user_id,user_name,user_time,user_url,clbz)
helper.modify_sql(insert_sql, data)
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# # test
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
user_id='519250015'
user_name= ' Please remember me '
user_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
user_time = '2021 year 2 month 18 Japan '
clbz = 'N'
insert_userinf(user_id,user_name,user_time,user_url,clbz)
print('test over')
Support error redo : Update back songinf surface
Redo for mistakes , We're done with one songinf The update processing flag is Y, When something goes wrong , The program automatically skips processing flag bits Y The record of , Only handle flag bits N The record of , So we can finish the relay .
So in order to complete this relay , We need to review users after climbing a song , Update back songinf. We need to create one python Program to insert this table .
python The program is named :updateSongURLSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def __del__(self):
self.cursor.close()
self.conn.close()
def updater_songurl(url):
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
sql = "UPDATE songinf SET clbz = 'Y' WHERE song_url= '%s'" % (url)
print('sql is :', sql)
helper.cursor.execute(sql)
helper.conn.commit()
if __name__ == '__main__':
url = 'https://music.163.com/#/song?id=569213220&lv=-1&kv=-1&tv=-1'
updater_songurl(url)
print('urllist = ',url )
print('update over')
Crawl comment users :
To prevent being banned by the server , This time we use selenium Automation control module to control browser access , In this way, the server cannot distinguish whether it is a crawler or a user , The disadvantage is that the speed is relatively slow , The current crawling speed is about 1 Hours 1000 User data .
Running all night , Now get 10 ten thousand + user ID. Here we need to use the song Of URL Information , We need to create one python Program to insert this table .
python The program is named :getSongURLSQL.py, The code is :
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'
import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
def __init__(self):
self.conn = pymysql.Connect(host='127.0.0.1',
port=3306,
user='root',
passwd='root',
db='python',
# When creating a database format, use utf8mb4 This format , Because it can store non characters such as facial expressions
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
# def modify_sql(self,sql,data):
# self.cursor.execute(sql,data)
# self.conn.commit()
def __del__(self):
self.cursor.close()
self.conn.close()
def select_songurl():
helper = Mysql_pq()
print(' Connected to the database python, Ready to insert song information ')
urllist = []
sql = "SELECT * FROM songinf WHERE clbz = 'N'"
helper.cursor.execute(sql)
results = helper.cursor.fetchall()
for row in results:
id = row[0]
song_id = row[1]
song_name = row[2]
song_url = row[3]
clbz = row[4]
# Print the results
print('id =', id)
print('song_url =',song_url)
urllist.append(song_url)
return urllist
if __name__ == '__main__':
# helper = Mysql_pq()
# print('test db')
# # test
# insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
# data = ('222222xxxxxx2222 ',)
# helper.modify_sql(insert_sql,data)
# song_id='519250015'
# song_name= ' Please remember me '
# song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
# clbz = 'N'
urllist = select_songurl()
print('urllist = ',urllist )
print('test over')
So the database mysql Very important .
The code is :
import re
import time
import numpy as np
from flask_cors.core import LOG
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ChromeOptions
from getSongURLSQL import *
from useridSpiderSQL import *
from updateSongURLSQL import *
def is_number(s):
try:
float(s)
return True
except ValueError:
pass
try:
import unicodedata
unicodedata.numeric(s)
return True
except (TypeError, ValueError):
pass
return False
def geturl(urllist):
# If driver Not added to the environment variable , Then you need to specify the path
# Verified on 2021 year 2 month 19 Japan
driver = webdriver.Firefox()
#driver = webdriver.Chrome()
driver.maximize_window()
driver.set_page_load_timeout(30)
driver.set_window_size(1124, 850)
# locator = (By.)
for url in urllist:
print('now the url is :',url)
driver.get(url)
time.sleep(3)
print(' Start landing ')
driver.switch_to.frame('g_iframe') # The music elements of Netease cloud are put in the framework !!!! Switch frames first
href_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'cnt f-brk')]//a[contains(@class,'s-fc7')]"
songid = driver.find_elements_by_xpath(href_xpath)
useridlist = []
usernamelist = []
for i in songid:
userurl = i.get_attribute('href')
userid = userurl[35:] # User id Numbers
print('userid = ',userid)
username = i.text
print('username = ',username)
try:
print('userid is ',userid)
if is_number(userid) : # The explanation is pure numbers
print(' user id It's the number. , Retain ')
useridlist.append(userid)
usernamelist.append(username)
else:
iter
except (TypeError, ValueError):
print(' user id The digital , discarded ')
iter
# Get user comment time
commenttimelist=[]
time_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'rp')]//div[contains(@class,'time s-fc4')]"
songtime = driver.find_elements_by_xpath(time_xpath)
for itime in songtime:
#print(i.get_attribute('href'))
commenttime = itime.text
print('commenttime = ',commenttime)
commenttimelist.append(commenttime)
if len(commenttimelist)< len(useridlist):
for i in np.arange(0,len(useridlist)-len(commenttimelist),1):
commenttimelist.append('2021 year 2 month 18 Japan ')
print('len(useridlist) is = ',len(useridlist))
for i in np.arange(0,len(useridlist),1):
userid_i = useridlist[i]
username_i = usernamelist[i]
commenttime_i = commenttimelist[i]
# Insert into database
print('userid_i=',userid_i)
print('username_i=', username_i)
print('commenttime_i=', commenttime_i)
userurl_i ='https://music.163.com/#/user/home?id=' + str.strip(userid_i)
print('userurl_i=', userurl_i)
clbz = 'N'
try:
insert_userinf(userid_i, username_i, commenttime_i, userurl_i, clbz)
except :
print(' Error inserting database ')
pass
time.sleep(5)
updater_songurl(url)
def is_login(source):
rs = re.search("CONFIG\['islogin'\]='(\d)'", source)
if rs:
return int(rs.group(1)) == 1
else:
return False
if __name__ == '__main__':
#url = 'https://music.163.com/#/discover/toplist?id=2884035'
urllist = select_songurl()
# urllist =['https://music.163.com/#/song?id=569200214&lv=-1&kv=-1&tv=-1','https://music.163.com/#/song?id=569200213&lv=-1&kv=-1&tv=-1']
geturl(urllist)
The results are as follows :
Here are a few notes :
1、 I didn't turn the page of the latest comment , To do it , You need to crawl the page turning button and click , And then crawl users again id.
2、 The contents of specific comments are not stored for the time being .
3、 The format of the crawled comment date information is very irregular , Follow up processing is required .
The next part , Step... Will be completed 3, Have the ability to 10w Level users pushed songs .
边栏推荐
- 矩阵求导及其链式法则
- Leetcode question brushing: String 05 (Sword finger offer 58 - ii. left rotation string)
- The relationship between the development of cloud computing technology and chip processor
- Comment installer la base de données MySQL 8.0 sous Windows? (tutoriel graphique)
- How to enable Hana cloud service on SAP BTP platform
- Common concurrent testing tools and pressure testing methods
- 聊聊我的远程工作体验 | 社区征文
- Muke 11. User authentication and authorization of microservices
- Android IO, a first-line Internet manufacturer, is a collection of real questions for senior Android interviews
- 0 basic C language (1)
猜你喜欢

龙芯中科科创板上市:市值357亿 成国产CPU第一股

诗尼曼家居冲刺A股:年营收近12亿 红星美凯龙与居然之家是股东

Yonghui released the data of Lantern Festival: the sales of Tangyuan increased significantly, and several people's livelihood products increased by more than 150%
![[Bayesian classification 2] naive Bayesian classifier](/img/44/dbff297e536508a7c18b76b21db90a.png)
[Bayesian classification 2] naive Bayesian classifier

俞敏洪:新东方并不存在倒下再翻身,摧毁又雄起的逆转

Hands on deep learning pytorch version 3 - Data Preprocessing

QT两种方法实现定时器

基于QT实现简单的连连看小游戏

第2章 构建自定义语料库

【 protobuf 】 quelques puits causés par la mise à niveau de protobuf
随机推荐
2022年,中轻度游戏出海路在何方?
【连载】说透运维监控系统01-监控系统概述
在哪家证券公司开户最方便最安全可靠
windows系統下怎麼安裝mysql8.0數據庫?(圖文教程)
Android mediacodec hard coded H264 file (four), ByteDance Android interview
后台查找,如何查找网站后台
Is there any risk in opening a mobile stock registration account? Is it safe?
Configure redis master-slave and sentinel sentinel in the centos7 environment (solve the problem that the sentinel does not switch when the master hangs up in the ECS)
Gee: calculate the maximum and minimum values of pixels in the image area
Icml2022 | neurotoxin: a lasting back door to federal learning
The importance of using fonts correctly in DataWindow
Shiniman household sprint A shares: annual revenue of nearly 1.2 billion red star Macalline and incredibly home are shareholders
MATLAB与Mysql数据库连接并数据交换(基于ODBC)
Dynamic parameter association using postman
Listing of maolaiguang discipline on the Innovation Board: it is planned to raise 400million yuan. Fanyi and fanhao brothers are the actual controllers
[most detailed] latest and complete redis interview (42 tracks)
矩阵求导及其链式法则
Leetcode question brushing: String 06 (implement strstr())
0 basic C language (2)
Mr. Sun's version of JDBC (21:34:25, June 12, 2022)