当前位置:网站首页>MongoDB 遇见 spark(进行整合)
MongoDB 遇见 spark(进行整合)
2022-07-07 11:17:00 【cui_yonghua】
基础篇(能解决工作中80%的问题):
进阶篇:
其它:
一. 与HDFS相比,MongoDB的优势
1、在存储方式上,HDFS以文件为单位,每个文件大小为 64M~128M, 而mongo则表现的更加细颗粒化;
2、MongoDB支持HDFS没有的索引概念,所以在读取速度上更快;
3、MongoDB更加容易进行修改数据;
4、HDFS响应级别为分钟,而MongoDB响应类别为毫秒;
5、可以利用MongoDB强大的 Aggregate功能进行数据筛选或预处理;
6、如果使用MongoDB,就不用像传统模式那样,到Redis内存数据库计算后,再将其另存到HDFS上。
二. 大数据的分层架构
MongoDB可以替换HDFS, 作为大数据平台中最核心的部分,可以分层如下:
第1层:MongoDB或者HDFS;
第2层:资源管理 如 YARN、Mesos、K8S;
第3层:计算引擎 如 MapReduce、Spark;
第4层:程序接口 如 Pig、Hive、Spark SQL、Spark Streaming、Data Frame等
参考:
mongo-python-driver: https://github.com/mongodb/mongo-python-driver/
三. 源码介绍
mongo-spark/examples/src/test/python/introduction.py
# -*- coding: UTF-8 -*-
#
# Copyright 2016 MongoDB, Inc.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# To run this example use:
# ./bin/spark-submit --master "local[4]" \
# --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \
# --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \
# --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
# introduction.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
logger = spark._jvm.org.apache.log4j
logger.LogManager.getRootLogger().setLevel(logger.Level.FATAL)
# Save some data
characters = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"])
characters.write.format("com.mongodb.spark.sql").mode("overwrite").save()
# print the schema
print("Schema:")
characters.printSchema()
# read from MongoDB collection
df = spark.read.format("com.mongodb.spark.sql").load()
# SQL
df.registerTempTable("temp")
centenarians = spark.sql("SELECT name, age FROM temp WHERE age >= 100")
print("Centenarians:")
centenarians.show()
边栏推荐
- MATLAB中polarscatter函数使用
- 基于鲲鹏原生安全,打造安全可信的计算平台
- Test next summary
- 谷歌浏览器如何重置?谷歌浏览器恢复默认设置?
- Ip2long and long2ip analysis
- DETR介绍
- Session
- 【学习笔记】zkw 线段树
- What kind of methods or functions can you view the laravel version of a project?
- 云检测2020:用于高分辨率遥感图像中云检测的自注意力生成对抗网络Self-Attentive Generative Adversarial Network for Cloud Detection
猜你喜欢
error LNK2019: 无法解析的外部符号
leecode3. 无重复字符的最长子串
日本政企员工喝醉丢失46万信息U盘,公开道歉又透露密码规则
Coscon'22 community convening order is coming! Open the world, invite all communities to embrace open source and open a new world~
人均瑞数系列,瑞数 4 代 JS 逆向分析
Session
达晨与小米投的凌云光上市:市值153亿 为机器植入眼睛和大脑
博文推荐|Apache Pulsar 跨地域复制方案选型实践
【Presto Profile系列】Timeline使用
Per capita Swiss number series, Swiss number 4 generation JS reverse analysis
随机推荐
Layer pop-up layer closing problem
基于鲲鹏原生安全,打造安全可信的计算平台
关于 appium 如何关闭 app (已解决)
Users, groups, and permissions
2022a special equipment related management (boiler, pressure vessel and pressure pipeline) simulated examination question bank simulated examination platform operation
Sequoia China completed the new phase of $9billion fund raising
Sample chapter of "uncover the secrets of asp.net core 6 framework" [200 pages /5 chapters]
PACP学习笔记一:使用 PCAP 编程
在字符串中查找id值MySQL
Day-24 UDP, regular expression
【学习笔记】zkw 线段树
【学习笔记】AGC010
3D content generation based on nerf
HZOJ #240. Graphic printing IV
达晨与小米投的凌云光上市:市值153亿 为机器植入眼睛和大脑
[learn microservices from 0] [03] explore the microservice architecture
聊聊伪共享
Practical example of propeller easydl: automatic scratch recognition of industrial parts
Leetcode skimming: binary tree 27 (delete nodes in the binary search tree)
【无标题】