当前位置:网站首页>Mongodb meets spark (for integration)
Mongodb meets spark (for integration)
2022-07-07 13:12:00 【cui_ yonghua】
The basic chapter ( Can solve the problem of 80% The problem of ):
MongoDB data type 、 Key concepts and shell Commonly used instructions
MongoDB Various additions to documents 、 to update 、 Delete operation summary
Advanced :
Other :
One . And HDFS comparison ,MongoDB The advantages of
1、 In terms of storage mode ,HDFS In documents , The size of each file is 64M~128M, and mongo The performance is more fine grained ;
2、MongoDB Support HDFS There is no index concept , So it is faster in reading speed ;
3、MongoDB It is easier to modify data ;
4、HDFS The response level is minutes , and MongoDB The response category is milliseconds ;
5、 You can use MongoDB Powerful Aggregate Function for data filtering or preprocessing ;
6、 If you use MongoDB, There is no need to be like the traditional mode , To Redis After memory database calculation , Then save it to HDFS On .
Two . Hierarchical architecture of big data
MongoDB Can replace HDFS, As the core part of big data platform , It can be layered as follows :
The first 1 layer :MongoDB perhaps HDFS;
The first 2 layer : Resource management Such as YARN、Mesos、K8S;
The first 3 layer : Calculation engine Such as MapReduce、Spark;
The first 4 layer : Program interface Such as Pig、Hive、Spark SQL、Spark Streaming、Data Frame etc.
Reference resources :
mongo-python-driver: https://github.com/mongodb/mongo-python-driver/
Official documents :https://www.mongodb.com/docs/spark-connector/current/
3、 ... and . The source code is introduced
mongo-spark/examples/src/test/python/introduction.py
# -*- coding: UTF-8 -*-
#
# Copyright 2016 MongoDB, Inc.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# To run this example use:
# ./bin/spark-submit --master "local[4]" \
# --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.coll?readPreference=primaryPreferred" \
# --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" \
# --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
# introduction.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
logger = spark._jvm.org.apache.log4j
logger.LogManager.getRootLogger().setLevel(logger.Level.FATAL)
# Save some data
characters = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"])
characters.write.format("com.mongodb.spark.sql").mode("overwrite").save()
# print the schema
print("Schema:")
characters.printSchema()
# read from MongoDB collection
df = spark.read.format("com.mongodb.spark.sql").load()
# SQL
df.registerTempTable("temp")
centenarians = spark.sql("SELECT name, age FROM temp WHERE age >= 100")
print("Centenarians:")
centenarians.show()
边栏推荐
- Ip2long and long2ip analysis
- DETR介绍
- 飞桨EasyDL实操范例:工业零件划痕自动识别
- ISPRS2021/遥感影像云检测:一种地理信息驱动的方法和一种新的大规模遥感云/雪检测数据集
- Vscode编辑器ESP32头文件波浪线不跳转彻底解决
- Smart cloud health listed: with a market value of HK $15billion, SIG Jingwei and Jingxin fund are shareholders
- Cmu15445 (fall 2019) project 2 - hash table details
- [untitled]
- Initialization script
- ESP32 ① 编译环境
猜你喜欢

Awk of three swordsmen in text processing

详细介绍六种开源协议(程序员须知)

Aosikang biological sprint scientific innovation board of Hillhouse Investment: annual revenue of 450million yuan, lost cooperation with kangxinuo

Lingyunguang of Dachen and Xiaomi investment is listed: the market value is 15.3 billion, and the machine is implanted into the eyes and brain
![[Presto profile series] timeline use](/img/c6/83c4fdc5f001dab34ecf18c022d710.png)
[Presto profile series] timeline use

【无标题】
![[learning notes] agc010](/img/2c/37f2537a4dadd84adacf3da5f1327a.png)
[learning notes] agc010

飞桨EasyDL实操范例:工业零件划痕自动识别

Scrapy教程经典实战【新概念英语】

将数学公式在el-table里面展示出来
随机推荐
【黑马早报】华为辟谣“军师”陈春花;恒驰5预售价17.9万元;周杰伦新专辑MV 3小时播放量破亿;法华寺回应万元月薪招人...
ClickHouse(03)ClickHouse怎么安装和部署
学习突围2 - 关于高效学习的方法
Vscade editor esp32 header file wavy line does not jump completely solved
记一次 .NET 某新能源系统 线程疯涨 分析
DrawerLayout禁止侧滑显示
[learning notes] segment tree selection
MongoDB 遇见 spark(进行整合)
博文推荐|Apache Pulsar 跨地域复制方案选型实践
自定义线程池拒绝策略
日本政企员工喝醉丢失46万信息U盘,公开道歉又透露密码规则
HZOJ #240. Graphic printing IV
Initialization script
test
Differences between MySQL storage engine MyISAM and InnoDB
OSI 七层模型
Layer pop-up layer closing problem
Isprs2021/ remote sensing image cloud detection: a geographic information driven method and a new large-scale remote sensing cloud / snow detection data set
.Net下極限生產力之efcore分錶分庫全自動化遷移CodeFirst
MongoDB 分片总结