当前位置:网站首页>Crawler (III) crawling house prices in Tianjin
Crawler (III) crawling house prices in Tianjin
2022-07-04 06:51:00 【wei2023】
One 、 Basic concepts
Web crawler (Crawler): Also known as web spider , Or cyber robots (Robots). It's a rule of thumb , A program or script that automatically grabs information from the world wide web . In other words , It can automatically obtain the web content according to the link address of the web page . If the Internet is compared to a big spider web , There are many web pages in it , Web spiders can get the content of all web pages .
Crawler is a simulated human behavior of requesting websites , A program or automated script that downloads Web resources in batches .Reptiles : Use any technical means , A way to get information about a website in bulk . The key is volume .
The crawler : Use any technical means , A way to prevent others from obtaining their own website information in bulk . The key is also volume .Accidental injury : In the process of anti crawler , Wrong identification of ordinary users as reptiles . Anti crawler strategy with high injury rate , No matter how good the effect is .
Intercept : Successfully blocked crawler access . Here's the concept of interception rate . Generally speaking , Anti crawler strategy with higher interception rate , The more likely it is to be injured by mistake . So there's a trade-off .
Two 、 Basic steps
(1) Request web page :
adopt HTTP The library makes a request to the target site , Send a Request, The request can contain additional headers etc.
Information , Wait for the server to respond !
(2) Get the corresponding content :
If the server can respond normally , You'll get one Response,Response The content of is the content of the page , There may be HTML,Json character string , binary data ( Like picture video ) Other types .
(3) Parsing content :
What you get may be HTML, You can use regular expressions 、 Web page parsing library for parsing . May be Json, Sure
Go straight to Json Object parsing , It could be binary data , It can be preserved or further processed .
(4) Store parsed data :
There are various forms of preservation , Can be saved as text , It can also be saved to a database , Or save a file in a specific format
#========== guide package =============
import requests
#=====step_1 : finger set url=========
url = 'https://gy.fang.lianjia.com/ /'
#=====step_2 : Hair rise please seek :======
# send use get Fang Law Hair rise get please seek , The Fang Law Meeting return return One individual ring Should be Yes like . ginseng Count url surface in please seek Yes Should be Of url
response = requests . get ( url = url )
#=====step_3 : a take ring Should be Count According to the :===
# through too transfer use ring Should be Yes like Of text Belong to sex , return return ring Should be Yes like in save Store Of word operator strand shape type Of ring Should be Count According to the ( page Noodles Source Code number According to the )
page_text = response . text
#====step_4 : a For a long time turn save Store =======
with open (' House prices in Tianjin . html ','w', encoding ='utf -8') as fp:
fp.write ( page_text )
print (' climb take Count According to the End BI !!!')
3、 ... and 、 practice : Crawl data
# ================== Import related libraries ==================================
from bs4 import BeautifulSoup
import numpy as np
import requests
from requests.exceptions import RequestException
import pandas as pd
# ============= Read the web page =========================================
def craw(url, page):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}
html1 = requests.request("GET", url, headers=headers, timeout=10)
html1.encoding = 'utf-8' # Add code , important ! Convert to string encoding ,read() Get is byte Format
html = html1.text
return html
except RequestException: # Other questions
print(' The first {0} Failed to read page '.format(page))
return None
# ========== Parse the web page and save the data to the table ======================
def pase_page(url, page):
html = craw(url, page)
html = str(html)
if html is not None:
soup = BeautifulSoup(html, 'lxml')
"-- First determine the house information , namely li Tag list --"
houses = soup.select('.resblock-list-wrapper li') # List of houses
"-- Then determine the information of each house --"
for j in range(len(houses)): # Traverse every house
house = houses[j]
" name "
recommend_project = house.select('.resblock-name a.name')
recommend_project = [i.get_text() for i in recommend_project] # name Yinghua Tianyuan , Binxin Jiangnan imperial residence ...
recommend_project = ' '.join(recommend_project)
# print(recommend_project)
" type "
house_type = house.select('.resblock-name span.resblock-type')
house_type = [i.get_text() for i in house_type] # Office buildings , commercial real estate under residential buildings ...
house_type = ' '.join(house_type)
# print(house_type)
" Room shape "
house_com = house.select('.resblock-room span')
house_com = [i.get_text() for i in house_com] # 2 room 3 room
house_com = ' '.join(house_com)
# print(house_com)
" area "
house_area = house.select('.resblock-area span')
house_area = [i.get_text() for i in house_area] # 2 room 3 room
house_area = ' '.join(house_area)
# print(house_area)
" Sales status "
sale_status = house.select('.resblock-name span.sale-status')
sale_status = [i.get_text() for i in sale_status] # On sale , On sale , sell out , On sale ...
sale_status = ' '.join(sale_status)
# print(sale_status)
" Large address "
big_address = house.select('.resblock-location span')
big_address = [i.get_text() for i in big_address] #
big_address = ''.join(big_address)
# print(big_address)
" Specific address "
small_address = house.select('.resblock-location a')
small_address = [i.get_text() for i in small_address] #
small_address = ' '.join(small_address)
# print(small_address)
" advantage ."
advantage = house.select('.resblock-tag span')
advantage = [i.get_text() for i in advantage] #
advantage = ' '.join(advantage)
# print(advantage)
" Average price : How many? 1 flat "
average_price = house.select('.resblock-price .main-price .number')
average_price = [i.get_text() for i in average_price] # 16000,25000, The price is to be determined ..
average_price = ' '.join(average_price)
# print(average_price)
" The total price , Ten thousand units "
total_price = house.select('.resblock-price .second')
total_price = [i.get_text() for i in total_price] # The total price 400 ten thousand / set , The total price 100 ten thousand / set '...
total_price = ' '.join(total_price)
# print(total_price)
# ===================== Write table =================================================
information = [recommend_project, house_type, house_com, house_area, sale_status, big_address,
small_address, advantage,
average_price, total_price]
information = np.array(information)
information = information.reshape(-1, 10)
information = pd.DataFrame(information,
columns=[' name ', ' type ', ' size ', ' Main area ', ' Sales status ', ' Large address ', ' Specific address ', ' advantage ', ' Average price ',
' The total price '])
information.to_csv(' House prices in Tianjin .csv', mode='a+', index=False, header=False) # mode='a+' Append write
print(' The first {0} Page storage data succeeded '.format(page))
else:
print(' Parse failure ')
# ================== Two threads =====================================
import threading
for i in range(1, 100, 2): # Traverse the web 1-101
url1 = "https://tj.fang.lianjia.com/loupan/pg" + str(i) + "/"
url2 = "https://tj.fang.lianjia.com/loupan/pg" + str(i + 1) + "/"
pase_page(url1, i)
t1 = threading.Thread(target=pase_page, args=(url1, i)) # Threads 1
t2 = threading.Thread(target=pase_page, args=(url2, i + 1)) # Threads 2
t1.start()
t2.start()
Crawled data : House prices in Tianjin .csv
Four 、 Data analysis
边栏推荐
- Responsive mobile web test questions
- Tar source code analysis 8
- [network data transmission] FPGA based development of 100M / Gigabit UDP packet sending and receiving system, PC to FPGA
- ORICO ORICO outdoor power experience, lightweight and portable, the most convenient office charging station
- Selection (022) - what is the output of the following code?
- Background and current situation of domestic CDN acceleration
- Google Chrome Portable Google Chrome browser portable version official website download method
- tars源码分析之10
- Explain in one sentence what social proof is
- R statistical mapping - random forest classification analysis and species abundance difference test combination diagram
猜你喜欢
移动适配:vw/vh
【GF(q)+LDPC】基于二值图GF(q)域的规则LDPC编译码设计与matlab仿真
uniapp 自定义环境变量
【问题记录】03 连接MySQL数据库提示:1040 Too many connections
What is the use of cloud redis? How to use cloud redis?
响应式移动Web测试题
R statistical mapping - random forest classification analysis and species abundance difference test combination diagram
notepad++如何统计单词数量
[number theory] fast power (Euler power)
The solution of win11 taskbar right click without Task Manager - add win11 taskbar right click function
随机推荐
Tar source code analysis 6
Data analysis notes 09
[Android reverse] function interception (CPU cache mechanism | CPU cache mechanism causes function interception failure)
ABCD four sequential execution methods, extended application
响应式移动Web测试题
Since DMS is upgraded to a new version, my previous SQL is in the old version of DMS. In this case, how can I retrieve my previous SQL?
P26-P34 third_ template
Is the insurance annuity product worth buying? Is there a hole?
《国民经济行业分类GB/T 4754—2017》官网下载地址
List of top ten professional skills required for data science work
[FPGA tutorial case 7] design and implementation of counter based on Verilog
tars源码分析之4
tars源码分析之10
What is a spotlight effect?
uniapp 自定義環境變量
the input device is not a TTY. If you are using mintty, try prefixing the command with ‘winpty‘
Summary of leetcode BFS question brushing
How to input single quotation marks and double quotation marks in latex?
Mysql 45讲学习笔记(七)行锁
Shopping malls, storerooms, flat display, user-defined maps can also be played like this!