当前位置:网站首页>Requset + BS4 crawling shell listings
Requset + BS4 crawling shell listings
2022-07-05 13:48:00 【Weichi Begonia】
""" Crawling shells to find houses """
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_data(keyword):
""" Get raw data :param keyword: :return: """
ip = '114.100.0.229:9999'
proxy = {
"http": ip}
url = 'https://bj.ke.com/ershoufang/rs{}'.format(keyword)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
res = requests.get(url, headers=headers, proxies=proxy)
with open('beike.txt', 'w', encoding='utf-8') as f:
f.write(res.text)
soup = BeautifulSoup(res.text)
item_list = soup.find_all('div', class_='title')
house_info_list = soup.find_all('div', class_='houseInfo')
total_price_list = soup.find_all('div', class_='totalPrice')
results = {
}
for i in range(len(house_info_list)):
title = item_list[i].a.text.strip()
results.setdefault(title, {
})
house_info = house_info_list[i].text.split('|')
results[title]['floor'] = house_info[0].replace('\n', '').replace(' ', '')
results[title]['year'] = house_info[1].replace('\n', '').replace(' ', '')
results[title]['jiegou'] = house_info[2].replace('\n', '').replace(' ', '')
results[title]['area'] = house_info[3].replace('\n', '').replace(' ', '')
results[title]['direction'] = house_info[4].replace('\n', '').replace(' ', '')
price = total_price_list[i].text.strip().split('\n')[0].replace('\n', '').replace(' ', '')
results[title]['price'] = price
detail_url = item_list[i].a.attrs['href']
detail_info = requests.get(detail_url)
detail_soup = BeautifulSoup(detail_info.text)
detail_list = detail_soup.find_all('div', class_='content')
# print(detail_soup.find_all('div', class_='unitPrice'))
unti_price = detail_soup.find('div', class_='unitPrice').span.text
results[title]['unti_price'] = unti_price + ' element / Square meters '
results[title]['detail_url'] = detail_url
for each in detail_list:
try:
if each.ul.li.span.text == ' The ratio of ladder households ':
tiHuBi = each.ul.li.text
results[title]['tiHuBi'] = tiHuBi
if each.ul.li.span.text == ' Building structure ':
jiegou = each.ul.li.text
results[title]['jiegou'] = jiegou
if each.ul.li.span.text == ' The type of building ':
leixing = each.ul.li.text
results[title]['leixing'] = leixing
except:
pass
with open('fangyuan.txt', 'w', encoding='utf-8') as f:
f.read(json.dumps(results))
return results
def filter_data():
""" Filter the original data :return: """
with open('fangyuan.txt', 'r', encoding='utf-8') as f:
results = json.loads(f.read())
df = pd.DataFrame(results)
print('yingkun')
return
if __name__ == '__main__':
keyword = ' The rising sun '
# get_data(keyword)
filter_data()
边栏推荐
- Solution to the prompt of could not close zip file during phpword use
- Solve the problem of invalid uni app configuration page and tabbar
- Rk3566 add LED
- jasypt配置文件加密|快速入门|实战
- Win10——轻量级小工具
- Etcd database source code analysis -- rawnode simple package
- Assembly language - Beginner's introduction
- [public class preview]: basis and practice of video quality evaluation
- 真正的缓存之王,Google Guava 只是弟弟
- Catch all asynchronous artifact completable future
猜你喜欢
ELFK部署
龙芯派2代烧写PMON和重装系统
zabbix 监控
Those things I didn't know until I took the postgraduate entrance examination
Can graduate students not learn English? As long as the score of postgraduate entrance examination English or CET-6 is high!
Godson 2nd generation burn PMON and reload system
[South China University of technology] information sharing of postgraduate entrance examination and re examination
Self built shooting range 2022
Primary code audit [no dolls (modification)] assessment
Solve the problem of invalid uni app configuration page and tabbar
随机推荐
"Baidu Cup" CTF competition in September, web:upload
redis6数据类型及操作总结
What about data leakage? " Watson k'7 moves to eliminate security threats
FPGA learning notes: vivado 2019.1 add IP MicroBlaze
Hide Chinese name
What is information security? What is included? What is the difference with network security?
js 从一个数组对象中取key 和value组成一个新的对象
PHP basic syntax
Jenkins installation
[cloud resources] what software is good for cloud resource security management? Why?
redis6事务和锁机制
Record in-depth learning - some bug handling
redis6主从复制及集群
Programmer growth Chapter 8: do a good job of testing
ELFK部署
How to divide a large 'tar' archive file into multiple files of a specific size
STM32 reverse entry
Binder communication process and servicemanager creation process
Don't know these four caching modes, dare you say you understand caching?
With 4 years of working experience, you can't tell five ways of communication between multithreads. Dare you believe it?