当前位置:网站首页>MongoDB 遇见 spark(进行整合)
MongoDB 遇见 spark(进行整合)
2022-07-07 11:17:00 【cui_yonghua】
基础篇(能解决工作中80%的问题):
进阶篇:
其它:
一. 与HDFS相比,MongoDB的优势
1、在存储方式上,HDFS以文件为单位,每个文件大小为 64M~128M, 而mongo则表现的更加细颗粒化;
2、MongoDB支持HDFS没有的索引概念,所以在读取速度上更快;
3、MongoDB更加容易进行修改数据;
4、HDFS响应级别为分钟,而MongoDB响应类别为毫秒;
5、可以利用MongoDB强大的 Aggregate功能进行数据筛选或预处理;
6、如果使用MongoDB,就不用像传统模式那样,到Redis内存数据库计算后,再将其另存到HDFS上。
二. 大数据的分层架构
MongoDB可以替换HDFS, 作为大数据平台中最核心的部分,可以分层如下:
第1层:MongoDB或者HDFS;
第2层:资源管理 如 YARN、Mesos、K8S;
第3层:计算引擎 如 MapReduce、Spark;
第4层:程序接口 如 Pig、Hive、Spark SQL、Spark Streaming、Data Frame等
参考:
mongo-python-driver: https://github.com/mongodb/mongo-python-driver/
三. 源码介绍
mongo-spark/examples/src/test/python/introduction.py
# -*- coding: UTF-8 -*-
#
# Copyright 2016 MongoDB, Inc.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# To run this example use:
# ./bin/spark-submit --master "local[4]" \
# --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \
# --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \
# --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
# introduction.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
logger = spark._jvm.org.apache.log4j
logger.LogManager.getRootLogger().setLevel(logger.Level.FATAL)
# Save some data
characters = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"])
characters.write.format("com.mongodb.spark.sql").mode("overwrite").save()
# print the schema
print("Schema:")
characters.printSchema()
# read from MongoDB collection
df = spark.read.format("com.mongodb.spark.sql").load()
# SQL
df.registerTempTable("temp")
centenarians = spark.sql("SELECT name, age FROM temp WHERE age >= 100")
print("Centenarians:")
centenarians.show()
边栏推荐
- Differences between MySQL storage engine MyISAM and InnoDB
- 通过Keil如何查看MCU的RAM与ROM使用情况
- - Oui. Migration entièrement automatisée de la Sous - base de données des tableaux d'effets sous net
- MySQL importing SQL files and common commands
- Initialization script
- 自定义线程池拒绝策略
- Practical example of propeller easydl: automatic scratch recognition of industrial parts
- test
- Leetcode question brushing: binary tree 26 (insertion operation in binary search tree)
- 共创软硬件协同生态:Graphcore IPU与百度飞桨的“联合提交”亮相MLPerf
猜你喜欢
Milkdown 控件图标
Ogre入门尝鲜
2022a special equipment related management (boiler, pressure vessel and pressure pipeline) simulated examination question bank simulated examination platform operation
红杉中国完成新一期90亿美元基金募集
Sample chapter of "uncover the secrets of asp.net core 6 framework" [200 pages /5 chapters]
云检测2020:用于高分辨率遥感图像中云检测的自注意力生成对抗网络Self-Attentive Generative Adversarial Network for Cloud Detection
详细介绍六种开源协议(程序员须知)
Leetcode brush question: binary tree 24 (the nearest common ancestor of binary tree)
- Oui. Migration entièrement automatisée de la Sous - base de données des tableaux d'effets sous net
error LNK2019: 无法解析的外部符号
随机推荐
聊聊伪共享
About how appium closes apps (resolved)
Leetcode skimming: binary tree 20 (search in binary search tree)
关于 appium 启动 app 后闪退的问题 - (已解决)
Practical example of propeller easydl: automatic scratch recognition of industrial parts
Cmu15445 (fall 2019) project 2 - hash table details
HZOJ #236. Recursive implementation of combinatorial enumeration
[learn wechat from 0] [00] Course Overview
[crawler] avoid script detection when using selenium
Differences between MySQL storage engine MyISAM and InnoDB
货物摆放问题
Common text processing tools
COSCon'22 社区召集令来啦!Open the World,邀请所有社区一起拥抱开源,打开新世界~
Leetcode skimming: binary tree 27 (delete nodes in the binary search tree)
谷歌浏览器如何重置?谷歌浏览器恢复默认设置?
Query whether a field has an index with MySQL
PACP学习笔记一:使用 PCAP 编程
【无标题】
Session
DrawerLayout禁止侧滑显示