当前位置:网站首页>Requset + BS4 crawling shell listings
Requset + BS4 crawling shell listings
2022-07-05 13:48:00 【Weichi Begonia】
""" Crawling shells to find houses """
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_data(keyword):
""" Get raw data :param keyword: :return: """
ip = '114.100.0.229:9999'
proxy = {
"http": ip}
url = 'https://bj.ke.com/ershoufang/rs{}'.format(keyword)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
res = requests.get(url, headers=headers, proxies=proxy)
with open('beike.txt', 'w', encoding='utf-8') as f:
f.write(res.text)
soup = BeautifulSoup(res.text)
item_list = soup.find_all('div', class_='title')
house_info_list = soup.find_all('div', class_='houseInfo')
total_price_list = soup.find_all('div', class_='totalPrice')
results = {
}
for i in range(len(house_info_list)):
title = item_list[i].a.text.strip()
results.setdefault(title, {
})
house_info = house_info_list[i].text.split('|')
results[title]['floor'] = house_info[0].replace('\n', '').replace(' ', '')
results[title]['year'] = house_info[1].replace('\n', '').replace(' ', '')
results[title]['jiegou'] = house_info[2].replace('\n', '').replace(' ', '')
results[title]['area'] = house_info[3].replace('\n', '').replace(' ', '')
results[title]['direction'] = house_info[4].replace('\n', '').replace(' ', '')
price = total_price_list[i].text.strip().split('\n')[0].replace('\n', '').replace(' ', '')
results[title]['price'] = price
detail_url = item_list[i].a.attrs['href']
detail_info = requests.get(detail_url)
detail_soup = BeautifulSoup(detail_info.text)
detail_list = detail_soup.find_all('div', class_='content')
# print(detail_soup.find_all('div', class_='unitPrice'))
unti_price = detail_soup.find('div', class_='unitPrice').span.text
results[title]['unti_price'] = unti_price + ' element / Square meters '
results[title]['detail_url'] = detail_url
for each in detail_list:
try:
if each.ul.li.span.text == ' The ratio of ladder households ':
tiHuBi = each.ul.li.text
results[title]['tiHuBi'] = tiHuBi
if each.ul.li.span.text == ' Building structure ':
jiegou = each.ul.li.text
results[title]['jiegou'] = jiegou
if each.ul.li.span.text == ' The type of building ':
leixing = each.ul.li.text
results[title]['leixing'] = leixing
except:
pass
with open('fangyuan.txt', 'w', encoding='utf-8') as f:
f.read(json.dumps(results))
return results
def filter_data():
""" Filter the original data :return: """
with open('fangyuan.txt', 'r', encoding='utf-8') as f:
results = json.loads(f.read())
df = pd.DataFrame(results)
print('yingkun')
return
if __name__ == '__main__':
keyword = ' The rising sun '
# get_data(keyword)
filter_data()
边栏推荐
- Record in-depth learning - some bug handling
- 不知道这4种缓存模式,敢说懂缓存吗?
- redis6数据类型及操作总结
- ::ffff:192.168.31.101 是一个什么地址?
- Laravel framework operation error: no application encryption key has been specified
- Intranet penetration tool NetApp
- How to apply the updated fluent 3.0 to applet development
- Interviewer soul torture: why does the code specification require SQL statements not to have too many joins?
- 【MySQL 使用秘籍】一网打尽 MySQL 时间和日期类型与相关操作函数(三)
- [public class preview]: basis and practice of video quality evaluation
猜你喜欢

These 18 websites can make your page background cool

Redis6 master-slave replication and clustering
Jetpack Compose入门到精通

那些考研后才知道的事

Attack and defense world crypto WP

南理工在线交流群

【公开课预告】:视频质量评价基础与实践

What happened to the communication industry in the first half of this year?

【云资源】云资源安全管理用什么软件好?为什么?

Can graduate students not learn English? As long as the score of postgraduate entrance examination English or CET-6 is high!
随机推荐
::ffff:192.168.31.101 是一个什么地址?
如何把大的‘tar‘存档文件分割成特定大小的多个文件
【 script secret pour l'utilisation de MySQL 】 un jeu en ligne sur l'heure et le type de date de MySQL et les fonctions d'exploitation connexes (3)
Elk enterprise log analysis system
What is a network port
redis6主从复制及集群
Jetpack compose introduction to mastery
Scientific running robot pancakeswap clip robot latest detailed tutorial
2022司钻(钻井)考试题库及模拟考试
Personal component - message prompt
jasypt配置文件加密|快速入门|实战
PHP basic syntax
leetcode 10. Regular Expression Matching 正则表达式匹配 (困难)
Ueditor + PHP enables Alibaba cloud OSS upload
[notes of in-depth study paper]transbtsv2: wider instead of deep transformer for medical image segmentation
mysql获得时间
Wonderful express | Tencent cloud database June issue
MySQL if else use case use
Could not set property 'ID' of 'class xx' with value 'XX' argument type mismatch solution
leetcode 10. Regular expression matching regular expression matching (difficult)