当前位置:网站首页>Requset + BS4 crawling shell listings
Requset + BS4 crawling shell listings
2022-07-05 13:48:00 【Weichi Begonia】
""" Crawling shells to find houses """
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_data(keyword):
""" Get raw data :param keyword: :return: """
ip = '114.100.0.229:9999'
proxy = {
"http": ip}
url = 'https://bj.ke.com/ershoufang/rs{}'.format(keyword)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
res = requests.get(url, headers=headers, proxies=proxy)
with open('beike.txt', 'w', encoding='utf-8') as f:
f.write(res.text)
soup = BeautifulSoup(res.text)
item_list = soup.find_all('div', class_='title')
house_info_list = soup.find_all('div', class_='houseInfo')
total_price_list = soup.find_all('div', class_='totalPrice')
results = {
}
for i in range(len(house_info_list)):
title = item_list[i].a.text.strip()
results.setdefault(title, {
})
house_info = house_info_list[i].text.split('|')
results[title]['floor'] = house_info[0].replace('\n', '').replace(' ', '')
results[title]['year'] = house_info[1].replace('\n', '').replace(' ', '')
results[title]['jiegou'] = house_info[2].replace('\n', '').replace(' ', '')
results[title]['area'] = house_info[3].replace('\n', '').replace(' ', '')
results[title]['direction'] = house_info[4].replace('\n', '').replace(' ', '')
price = total_price_list[i].text.strip().split('\n')[0].replace('\n', '').replace(' ', '')
results[title]['price'] = price
detail_url = item_list[i].a.attrs['href']
detail_info = requests.get(detail_url)
detail_soup = BeautifulSoup(detail_info.text)
detail_list = detail_soup.find_all('div', class_='content')
# print(detail_soup.find_all('div', class_='unitPrice'))
unti_price = detail_soup.find('div', class_='unitPrice').span.text
results[title]['unti_price'] = unti_price + ' element / Square meters '
results[title]['detail_url'] = detail_url
for each in detail_list:
try:
if each.ul.li.span.text == ' The ratio of ladder households ':
tiHuBi = each.ul.li.text
results[title]['tiHuBi'] = tiHuBi
if each.ul.li.span.text == ' Building structure ':
jiegou = each.ul.li.text
results[title]['jiegou'] = jiegou
if each.ul.li.span.text == ' The type of building ':
leixing = each.ul.li.text
results[title]['leixing'] = leixing
except:
pass
with open('fangyuan.txt', 'w', encoding='utf-8') as f:
f.read(json.dumps(results))
return results
def filter_data():
""" Filter the original data :return: """
with open('fangyuan.txt', 'r', encoding='utf-8') as f:
results = json.loads(f.read())
df = pd.DataFrame(results)
print('yingkun')
return
if __name__ == '__main__':
keyword = ' The rising sun '
# get_data(keyword)
filter_data()
边栏推荐
- Redis6 transaction and locking mechanism
- Network security HSRP protocol
- Log4j utilization correlation
- js 从一个数组对象中取key 和value组成一个新的对象
- zabbix 监控
- Laravel框架运行报错:No application encryption key has been specified
- These 18 websites can make your page background cool
- Jetpack compose introduction to mastery
- Idea remote debugging agent
- 那些考研后才知道的事
猜你喜欢

The real king of caching, Google guava is just a brother

MySQL - database query - sort query, paging query

Binder communication process and servicemanager creation process

Interviewer soul torture: why does the code specification require SQL statements not to have too many joins?

Introduction to Chapter 8 proof problem of njupt "Xin'an numeral base"

French scholars: the explicability of counter attack under optimal transmission theory

荐号 | 有趣的人都在看什么?

Laravel框架运行报错:No application encryption key has been specified

::ffff:192.168.31.101 是一个什么地址?

What happened to the communication industry in the first half of this year?
随机推荐
内网穿透工具 netapp
Matlab paper chart standard format output (dry goods)
Attack and defense world web WP
真正的缓存之王,Google Guava 只是弟弟
私有地址有那些
Summary and arrangement of JPA specifications
Internal JSON-RPC error. {"code":-32000, "message": "execution reverted"} solve the error
Address book (linked list implementation)
Clock cycle
PHP generate Poster
Jasypt configuration file encryption | quick start | actual combat
Redis6 transaction and locking mechanism
restTemplate详解
laravel-dompdf导出pdf,中文乱码问题解决
aspx 简单的用户登录
Catch all asynchronous artifact completable future
Don't know these four caching modes, dare you say you understand caching?
Resttemplate details
Embedded software architecture design - message interaction
Could not set property 'ID' of 'class xx' with value 'XX' argument type mismatch solution