当前位置:网站首页>Mongodb meets spark (for integration)

Mongodb meets spark (for integration)

2022-07-07 13:12:00 cui_ yonghua

The basic chapter ( Can solve the problem of 80% The problem of ):

  1. MongoDB Overview 、 Application scenarios 、 Download mode 、 Connection mode and development history, etc

  2. MongoDB data type 、 Key concepts and shell Commonly used instructions

  3. MongoDB Various additions to documents 、 to update 、 Delete operation summary

  4. MongoDB Summary of various query operations

  5. MongoDB Summarize the various operations of the column

  6. MongoDB Summary of index operations in

Advanced :

  1. MongoDB Summary of aggregation operations

  2. MongoDB Import and export of 、 Backup recovery summary

  3. MongoDB Summary of user management

  4. MongoDB Copy ( Replica set ) summary

  5. MongoDB Slice summary

  6. MongoDB meet spark( Integration )

  7. MongoDB Internal storage principle

Other :

  1. python3 operation MongoDB Various cases of

  2. MongoDB Command summary

One . And HDFS comparison ,MongoDB The advantages of

1、 In terms of storage mode ,HDFS In documents , The size of each file is 64M~128M, and mongo The performance is more fine grained ;
2、MongoDB Support HDFS There is no index concept , So it is faster in reading speed ;
3、MongoDB It is easier to modify data ;
4、HDFS The response level is minutes , and MongoDB The response category is milliseconds ;
5、 You can use MongoDB Powerful Aggregate Function for data filtering or preprocessing ;
6、 If you use MongoDB, There is no need to be like the traditional mode , To Redis After memory database calculation , Then save it to HDFS On .

Two . Hierarchical architecture of big data

MongoDB Can replace HDFS, As the core part of big data platform , It can be layered as follows :
The first 1 layer :MongoDB perhaps HDFS;
The first 2 layer : Resource management Such as YARN、Mesos、K8S;
The first 3 layer : Calculation engine Such as MapReduce、Spark;
The first 4 layer : Program interface Such as Pig、Hive、Spark SQL、Spark Streaming、Data Frame etc.

Reference resources :

  1. github:https://github.com/mongodb/mongo-spark

  2. mongo-python-driver: https://github.com/mongodb/mongo-python-driver/

  3. Official documents :https://www.mongodb.com/docs/spark-connector/current/

3、 ... and . The source code is introduced


# -*- coding: UTF-8 -*-
# Copyright 2016 MongoDB, Inc.
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
# To run this example use:
# ./bin/spark-submit --master "local[4]" \
# --conf "spark.mongodb.input.uri=mongodb://" \
# --conf "spark.mongodb.output.uri=mongodb://" \
# --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
# introduction.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
    logger = spark._jvm.org.apache.log4j
    # Save some data
    characters = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"])
    # print the schema
    # read from MongoDB collection
    df = spark.read.format("com.mongodb.spark.sql").load()
    # SQL
    centenarians = spark.sql("SELECT name, age FROM temp WHERE age >= 100")

本文为[cui_ yonghua]所创,转载请带上原文链接,感谢