当前位置:网站首页>Crawler (III) crawling house prices in Tianjin
Crawler (III) crawling house prices in Tianjin
2022-07-04 06:51:00 【wei2023】
One 、 Basic concepts
Web crawler (Crawler): Also known as web spider , Or cyber robots (Robots). It's a rule of thumb , A program or script that automatically grabs information from the world wide web . In other words , It can automatically obtain the web content according to the link address of the web page . If the Internet is compared to a big spider web , There are many web pages in it , Web spiders can get the content of all web pages .
Crawler is a simulated human behavior of requesting websites , A program or automated script that downloads Web resources in batches .Reptiles : Use any technical means , A way to get information about a website in bulk . The key is volume .
The crawler : Use any technical means , A way to prevent others from obtaining their own website information in bulk . The key is also volume .Accidental injury : In the process of anti crawler , Wrong identification of ordinary users as reptiles . Anti crawler strategy with high injury rate , No matter how good the effect is .
Intercept : Successfully blocked crawler access . Here's the concept of interception rate . Generally speaking , Anti crawler strategy with higher interception rate , The more likely it is to be injured by mistake . So there's a trade-off .
Two 、 Basic steps
(1) Request web page :
adopt HTTP The library makes a request to the target site , Send a Request, The request can contain additional headers etc.
Information , Wait for the server to respond !
(2) Get the corresponding content :
If the server can respond normally , You'll get one Response,Response The content of is the content of the page , There may be HTML,Json character string , binary data ( Like picture video ) Other types .
(3) Parsing content :
What you get may be HTML, You can use regular expressions 、 Web page parsing library for parsing . May be Json, Sure
Go straight to Json Object parsing , It could be binary data , It can be preserved or further processed .
(4) Store parsed data :
There are various forms of preservation , Can be saved as text , It can also be saved to a database , Or save a file in a specific format
#========== guide package =============
import requests
#=====step_1 : finger set url=========
url = 'https://gy.fang.lianjia.com/ /'
#=====step_2 : Hair rise please seek :======
# send use get Fang Law Hair rise get please seek , The Fang Law Meeting return return One individual ring Should be Yes like . ginseng Count url surface in please seek Yes Should be Of url
response = requests . get ( url = url )
#=====step_3 : a take ring Should be Count According to the :===
# through too transfer use ring Should be Yes like Of text Belong to sex , return return ring Should be Yes like in save Store Of word operator strand shape type Of ring Should be Count According to the ( page Noodles Source Code number According to the )
page_text = response . text
#====step_4 : a For a long time turn save Store =======
with open (' House prices in Tianjin . html ','w', encoding ='utf -8') as fp:
fp.write ( page_text )
print (' climb take Count According to the End BI !!!')
3、 ... and 、 practice : Crawl data
# ================== Import related libraries ==================================
from bs4 import BeautifulSoup
import numpy as np
import requests
from requests.exceptions import RequestException
import pandas as pd
# ============= Read the web page =========================================
def craw(url, page):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}
html1 = requests.request("GET", url, headers=headers, timeout=10)
html1.encoding = 'utf-8' # Add code , important ! Convert to string encoding ,read() Get is byte Format
html = html1.text
return html
except RequestException: # Other questions
print(' The first {0} Failed to read page '.format(page))
return None
# ========== Parse the web page and save the data to the table ======================
def pase_page(url, page):
html = craw(url, page)
html = str(html)
if html is not None:
soup = BeautifulSoup(html, 'lxml')
"-- First determine the house information , namely li Tag list --"
houses = soup.select('.resblock-list-wrapper li') # List of houses
"-- Then determine the information of each house --"
for j in range(len(houses)): # Traverse every house
house = houses[j]
" name "
recommend_project = house.select('.resblock-name a.name')
recommend_project = [i.get_text() for i in recommend_project] # name Yinghua Tianyuan , Binxin Jiangnan imperial residence ...
recommend_project = ' '.join(recommend_project)
# print(recommend_project)
" type "
house_type = house.select('.resblock-name span.resblock-type')
house_type = [i.get_text() for i in house_type] # Office buildings , commercial real estate under residential buildings ...
house_type = ' '.join(house_type)
# print(house_type)
" Room shape "
house_com = house.select('.resblock-room span')
house_com = [i.get_text() for i in house_com] # 2 room 3 room
house_com = ' '.join(house_com)
# print(house_com)
" area "
house_area = house.select('.resblock-area span')
house_area = [i.get_text() for i in house_area] # 2 room 3 room
house_area = ' '.join(house_area)
# print(house_area)
" Sales status "
sale_status = house.select('.resblock-name span.sale-status')
sale_status = [i.get_text() for i in sale_status] # On sale , On sale , sell out , On sale ...
sale_status = ' '.join(sale_status)
# print(sale_status)
" Large address "
big_address = house.select('.resblock-location span')
big_address = [i.get_text() for i in big_address] #
big_address = ''.join(big_address)
# print(big_address)
" Specific address "
small_address = house.select('.resblock-location a')
small_address = [i.get_text() for i in small_address] #
small_address = ' '.join(small_address)
# print(small_address)
" advantage ."
advantage = house.select('.resblock-tag span')
advantage = [i.get_text() for i in advantage] #
advantage = ' '.join(advantage)
# print(advantage)
" Average price : How many? 1 flat "
average_price = house.select('.resblock-price .main-price .number')
average_price = [i.get_text() for i in average_price] # 16000,25000, The price is to be determined ..
average_price = ' '.join(average_price)
# print(average_price)
" The total price , Ten thousand units "
total_price = house.select('.resblock-price .second')
total_price = [i.get_text() for i in total_price] # The total price 400 ten thousand / set , The total price 100 ten thousand / set '...
total_price = ' '.join(total_price)
# print(total_price)
# ===================== Write table =================================================
information = [recommend_project, house_type, house_com, house_area, sale_status, big_address,
small_address, advantage,
average_price, total_price]
information = np.array(information)
information = information.reshape(-1, 10)
information = pd.DataFrame(information,
columns=[' name ', ' type ', ' size ', ' Main area ', ' Sales status ', ' Large address ', ' Specific address ', ' advantage ', ' Average price ',
' The total price '])
information.to_csv(' House prices in Tianjin .csv', mode='a+', index=False, header=False) # mode='a+' Append write
print(' The first {0} Page storage data succeeded '.format(page))
else:
print(' Parse failure ')
# ================== Two threads =====================================
import threading
for i in range(1, 100, 2): # Traverse the web 1-101
url1 = "https://tj.fang.lianjia.com/loupan/pg" + str(i) + "/"
url2 = "https://tj.fang.lianjia.com/loupan/pg" + str(i + 1) + "/"
pase_page(url1, i)
t1 = threading.Thread(target=pase_page, args=(url1, i)) # Threads 1
t2 = threading.Thread(target=pase_page, args=(url2, i + 1)) # Threads 2
t1.start()
t2.start()
Crawled data : House prices in Tianjin .csv
Four 、 Data analysis
边栏推荐
- Mysql 45讲学习笔记(十四)count(*)
- Mobile adaptation: vw/vh
- Tar source code analysis 9
- [number theory] fast power (Euler power)
- 【MySQL】数据库视图的介绍、作用、创建、查看、删除和修改(附练习题)
- 2022, peut - être la meilleure année économique de la prochaine décennie, avez - vous obtenu votre diplôme en 2022? Comment est - ce prévu après la remise des diplômes?
- CORS is not intended to protect API endpoints - nikofischer
- Mysql 45讲学习笔记(十一)字符串字段怎么加索引
- Uniapp custom environment variables
- tars源码分析之3
猜你喜欢

2022 wechat enterprise mailbox login entry introduction, how to open and register enterprise wechat enterprise mailbox?

Centos8 install mysql 7 unable to start up

【网络数据传输】基于FPGA的百兆网/兆网千UDP数据包收发系统开发,PC到FPGA

Campus network problems
![[GF (q) + LDPC] regular LDPC coding and decoding design and MATLAB simulation based on the GF (q) field of binary graph](/img/5e/7ce21dd544aacf23b4ceef1ec547fd.png)
[GF (q) + LDPC] regular LDPC coding and decoding design and MATLAB simulation based on the GF (q) field of binary graph

Common usage of time library

notepad++如何统计单词数量
![[Android reverse] function interception (CPU cache mechanism | CPU cache mechanism causes function interception failure)](/img/7e/02bb01480257cd56537914a7247733.jpg)
[Android reverse] function interception (CPU cache mechanism | CPU cache mechanism causes function interception failure)

Deep understanding of redis -- a new type of bitmap / hyperloglgo / Geo

selenium驱动IE常见问题解决Message: Currently focused window has been closed.
随机推荐
MySQL relearn 2- Alibaba cloud server CentOS installation mysql8.0
Wechat applet scroll view component scrollable view area
ABCD four sequential execution methods, extended application
ABCD four sequential execution methods, extended application
Selenium driver ie common problem solving message: currently focused window has been closed
GoogleChromePortable 谷歌chrome浏览器便携版官网下载方式
js 常用时间处理函数
高薪程序员&面试题精讲系列119之Redis如何实现分布式锁?
Appium foundation - appium installation (II)
How notepad++ counts words
The solution of win11 taskbar right click without Task Manager - add win11 taskbar right click function
About how idea sets up shortcut key sets
【问题记录】03 连接MySQL数据库提示:1040 Too many connections
1、 Relevant theories and tools of network security penetration testing
Campus network problems
2022 is probably the best year for the economy in the next 10 years. Did you graduate in 2022? What is the plan after graduation?
Cochez une colonne d'affichage dans une colonne de tableau connue
云Redis 有什么用? 云redis怎么用?
uniapp小程序分包
Responsive - media query