当前位置：网站首页>Sparkstreaming real-time data warehouse question & answer

Sparkstreaming real-time data warehouse question & answer

2022-06-10 12:47:00 【WK-WK】

The overview

1、 What is real-time computing

Real time computing is generally carried out for massive data , And the requirement is second level
Real time computing （ project ） Characteristics ：
1、 Once started , As long as you don't close the program , The program will always use resources
2、 The data source is real-time and uninterrupted
3、 Large amount of data and no budget is available or necessary , But the response time to users is required to be real-time
Add ： The main emphasis of real-time computing is low latency 、 High throughput （ There is no large amount of data to calculate at the same time ）

2、 The measure

The batch ：5 second -1 minute , Batch processing depends on the size of the window , And the amount of data obtained
real-time data ： The micro batch data processing delay is in the order of seconds , Stream data processing latency is milliseconds

3、 Give some examples of real-time computing

1、 By region Access details
2、 By region transaction
3、 By region Order
4、 By region amount of money
（ Add ： Real time computing mainly shows that the data is the change trend in time ）

4、 What are the implementation technologies of real-time computing

Flink： Stream processing
clickhouse
SparkStreaming： The batch
ElasticSearch

5、 Where is the pressure calculated by a single machine

Limited memory resources 、 The number of threads is limited ,mysql Relational databases contain more than a certain amount of data , It takes a long time to calculate
Add ：
Computing resources ： Memory 、 nucleus
Storage resources ： disk
Fault tolerance ： Insufficient individual resources are prone to downtime

6、 Real time computing architecture

The project uses three phases , They are the acquisition stage 、 Real time computing phase 、 Ad hoc query phase
The real-time computing stage is divided into ： Data diversion （ Dimension real time ）、 Data aggregation （ Business data supplementary connection ）
Source file download ：https://gitee.com/wang0101gitee/spark.git

7、 Why is it designed like this

Ensure the timeliness of data , It is necessary to reduce the intermediate process as much as possible , But it should not be too simplified , The incoming data is processed directly , Too much data , Calculation time severely timed out , In order to balance the relationship between the two , It is generally divided into three stages , Rational use of resources for computing
Three layer architecture ：
ODS： Store raw data
DWD： Various subject data （DIM）
DWS： Summary data
（ Add ：SparkStreaming You can set the batch time , A large amount of data can be reconciled ）

Chapter one ： Acquisition phase

Log data

1、 How to collect ？

Two kinds of schemes
1、 Log data （ The incremental ）, Do not keep historical data , adopt flume Collect and send to the next layer
2、 The pursuit of timeliness , Data can be sent directly to the next layer
a.web service
b.nginx

2、 The advantages and disadvantages of the two schemes ？

1、 use flume Breakpoint continuation can be realized , If the node of the next layer collapses , Data can still be preserved , Wait for the next layer to recover
The speed is not as fast as the second , If the memory channel If flume Crash will still lose data
2、 Fast , But the security is not high
a.web The way to establish a connection is to send directly
b.nginx You can build clusters , Take out a single node for load balancing
3、 Where to send the data ？ Why? ？ Why not calculate directly ？
The data will be sent directly to kafka Storage , Storage is not an end , Pulling is the goal ,
If the data is calculated directly ,sparkstreaming Inconsistency between sending rate and processing data will lead to data backlog
The solution is that the data can be sent to a message queue , Main features: current limiting and peak shaving , Asynchronous processing , Therefore, the only components available are kafka, Take this as the storage layer of the original data
（ Add ： Current limiting and peak shaving are the characteristics of queues ,Kafka Designed for big data , Large throughput 、 High availability 【 Distributed architecture 】）
If sparkstreaming Calculate directly , Only data can be pulled , If the log data falls on the disk , Will affect efficiency , Read and write fast , Write disks in order

4、 Log in kafka How to store ？ Why store like this ？

1、 The log server sends it directly to kafka, Tasks belonging to the back end , There is no need for any treatment
2、 The log is sent to... In its original structure kafka Become a theme （ Partition at the same time , For distributed computing ）

5、 Why should the topic be partitioned ？

Increase the parallelism of computation
How to avoid data skew ？
use mid Make the partition key Why not userid, Starting an unlisted user may be empty

Business data

1、 Business collection method ？

1、 Incremental acquisition
2、 Full collection

2、 How to realize incremental synchronization of business data ？ detailed ？

– technology
Turn on mysql Of binlog,binlog Will record the user's mysql The increased 、 Delete 、 Every step of the change ;
This function can be used to realize incremental synchronization , Turn on binlog Only locally generated files , Synchronize data to kafka, Need some specific tools , Such as maxwell and canal

Specific implementation principle
All disguised as slaves ,mysql Turn on mysql Master slave copy , The data will be transferred to the corresponding components

Is there any change in the data structure before and after transmission ？
It has changed , Specific changes are defined by the component

3、 Will this part lose data

Both components have the function of breakpoint continuation , Each has its own maintained offset , After the crash , No loss of data
binlog Different patterns of ？
1、row：
Record the contents before and after each change
advantage ： The data is absolutely consistent ,
shortcoming ： Waste space
2、statement
Each write statement
advantage ： Save a space
shortcoming ： Data consistency cannot be guaranteed Such as The two places execute statements separately , Such as now（）
3、mixedlevel
Two mixed together , by default statement, The following situations will be used row
1、 contain UUID（）
2、 contain AUTO_INCREMENT When the table of fields is updated ;
3、Insert delayed sentence
4、UDF when

 advantage ： Save a space   At the same time, we should take into account the consistency 
 shortcoming ： In some cases, the data will still be inconsistent

4、 Why is there a full import process ？

Full import is mainly used to import dimension table data , Dimension table data may be used in future calculations , Incremental import cannot import historical data

5、 Implementation of full import ？

1、datax
The data structure passed , Will adapt to the data format of the target container
2、maxwell Of bootstrap
The synchronized data format will change , Read only tables , Data operation types are beyond the beginning and end , Its types are all insert

6、 Frequency of full synchronization

Real time computing , Generally, only one synchronization is required
When dimension data changes

7、 In full synchronization , The program is down , Will there be duplicate data

There will be duplicate data , But because of the small amount of data , Will not affect the calculation
（ Add ： Idempotent operations can also be used to synchronize data to avoid data duplication ）

8、 Avoid data skew

Incremental synchronization can define partition primary keys
Full synchronization doesn't have much data , Not to be considered

kafka Storage

1、 The main role ？ Why choose kafka？

The above has answered
effect ： decoupling 、 Data peak shaving 、 High throughput 、 High applicability （ Distributed framework ）、 Partition （ Distributed computing , Improve parallelism ）

2、kafka Storage of data ？

1 How many topics, how many partitions, how many copies
How many partitions depend on how many consumers there are behind Peak value / Number of consumers
The number of copies depends on whether you want to consider security

3、kafka Will the accepted data duplicate ？

Idempotency
ack

4、kafka The process of a node is down , Will the data be lost ？

Multiple copies will not , If the number of partitions is exactly equal to kafka Number of nodes , The processing of data will decrease

5、 throughput

You can use the test script to measure the following local read and write speeds per minute

6、kafka monitor

Self developed monitor , Open source monitor eagle

7、kafka How long to save data

Save as an offline project 3 God
（ Add ：kafka The default save time is 7 God ）

Chapter two ： Real time computing

About real time

1、 Mainstream computing methods ？

spark Batch processing , What we use here is spark Batch processing （ Micro batch simulation flow data ）

flink Stream batch integration

2、 Why not calculate the result directly ？

If the amount of data is small, the result can be calculated quickly
But the characteristic of fact calculation is , Low latency , High throughput analysis data
（ repeat ）

3、 It also needs to adopt a hierarchical structure ？

Ensure the timeliness of data , It is necessary to reduce the intermediate process as much as possible , But there should be no oversimplification , The incoming data is processed directly , Too much data , Calculation time severely timed out , In order to balance the relationship between the two , It is generally divided into three stages , Rational use of resources for computing
Three layer architecture ：
ODS： Store raw data
DWD： Various subject data
DWS： Summary data

ODS -> DWD

1、ODS The main purpose of the layer ？ Who is the carrier ？

Purpose ： Save historical data , As a buffer （ Decouple business systems from data computing ）
carrier ：kafka
Reasons for selection ： Message queue , Write fast , Read fast , Other databases can't afford a lot of data 、（ Distributed architecture ）、 Partition and parallel processing

2、 from ODS Layer to DWD layer , How the data flows ？

 adopt SparkStreaming The procedure is divided , Write different programs to deal with different data sources

Log data ： Assign to according to different page information kafka Different themes
Business data ： Split the fact table and dimension table （ Dimension table data flow Redis）

3、 How to achieve accurate one-time consumption ？

If the program doesn't crash , The method of post commit offset , No data loss can be guaranteed
To solve data duplication, you can use the container itself Idempotency to solve , Or you can use transaction operations , Inserting and updating offsets are successful
Idempotency is used to solve data duplication in the project
( Add ： If the target container does not perform statistics , Data duplication is allowed , Reduce computational pressure )

4、 Where to maintain offsets ？ Why? ？

The project uses redis, Because it is necessary to ensure accurate one-time consumption , The maintenance of offset depends on the idempotency of the container itself
There are many options for containers Such as
1、mysql replace Idempotent
2、hbase put rowkey Idempotent
3、redis set type hset type Idempotent
4、hive insert overwrite Idempotent operation
choose redis, Read and write fast , shortcoming ： Data will be lost if it crashes

The type of offset storage hash Why hash？

5、 Accurate primary production data ？

On the premise that the program does not collapse ： Accurate once = ack（-1） + Producer turns on idempotency

6、 In different scenes , Selection of the container for storing the offset ？

7、redis Collapse , What are the consequences ？

redis Collapse , Program acquisition redis The connection fails , The program failed to run , Will there be duplicate data when running again ？

8、 Operations on business data ？

The program divides the dimension table and the fact table , Each table is also stored separately
For dimension data ： There is little change except for the user table , Most of them are wide watches , The database can be selected for long-term storage
For factual data ： A lot of data , And constantly changing , So it exists in kafka in
（ Add ： Dimension data involves various requirements for business demand calculation , Read frequently ）

9、 Storage of dimension tables and fact tables

The containers that dimension tables can select are ：
1、redis Fastest read
2、hbase The amount of storage is relatively large
3、mysql In between , The cheapest cost
Select... In the project redis, reason , Read fast , It's not easy to break down
Real time tables kafka

10、 Is there any point that can be optimized ？ Illustrate with examples

redis Clustering improves fault tolerance

11、DWD The main purpose of the layer ？ Who is the carrier ？

 Used to store changing fact tables , Carrier is kafka

12、 How to distinguish a business table from a dimension table or a fact table

 The program cannot automatically distinguish , Manual intervention required , You can specify which dimension tables are in the program , Which are fact tables

The consequence of doing so , A new table will appear later , Need to change the original code , Therefore, we choose to store the information of the fact table and dimension table in other containers , Such as mysql、hbase、redis
What is used in the project redis Define dimension table name key And fact table names key
（ Add ： The point is that , Because this information may be modified at any time , So when performing each calculation , You need to read the information once , So as to meet the requirements of timely updating ）

13、 write in redis Whether the data can be orderly

Need full analysis
1、 From data sources to kafka, Data is in order （ stay kafka in , The partition is orderly , The partitions are out of order ）
2、 When the program reads data , Because there are multiple partitions and multiple consumers , There is no guarantee that the order of consumption from the partition

solve ： It can ensure local order , The data within a partition is ordered
stay canal The incoming to kafka In the time , Data can be defined to be transmitted by partition key
（ Add ： Send the data that needs to be ordered to a partition , For example, the order in which users access pages , Can be through mid_id As partition key ）

14、 Broadcast variables （ Optimization of data resource reading ）

Every RDD From the redis Read the table name inside , Every RDD All partitions of will be read once , Waste resources
Optimization means , stay driver Read the table name in , Encapsulated as broadcast variables and distributed to each Executor

15、 In logic code , Design for different resource acquisition and release , How to achieve ？

External resources ：Kafka、Redis
Specific operation ： Calculate the data 、 Offset 、 Dimensional data 、 Static resources
Design principle ： According to the environment in which the resource is used （driver、executor） Different , Use different operators flexibly to distinguish the environment .

DWD->DWS

Why write to DWS in ？ Can't the program get the result directly ？

There is still a lot of data ,sparkStreaming Do some aggregation operations , involves shuffle The calculation is still slow
In order to speed up the query efficiency, we put OLAP in
DWS The carrier of ？ Why? ？
The carrier is mainly OLAP Analysis tools , In this project ES
advantage ：
1、OLAP More computing power A large amount of data can be managed fast
2、 polymerization
3、 Range data filtering
4、OLAP It is faster to develop a business indicator
5、 The column type storage , Easy to aggregate
shortcoming
1、 relation Join
2、 Complex semantic expression
3、 Window function or some algorithm
4、 machine learning
5、 External connection

Among them SparkStreaming What did it do ？

solve ES What you are not good at
How to write programs according to business ？ Two examples
1、 First, define the measurement of data in the presentation
2、 Determine dimension information about the measure
3、 get data , Supplement related data
4、 Aggregate queries on complete data
for example ：
1 Generate the schedule of daily activities （ duplicate removal 、 Dimension connection ）
2 Generate the details of the order placing business （ Dimension connection 、 Double current join）
De duplication of data ？
Ideas ： Storage status ？ One by one
Realization ：1、spark Has its own maintenance status in , But it can be stored again hdfs On , Reading is troublesome
2、 Use other containers ,redis、hbase、mysql

Possible problems ：spark Programs can be executed in parallel with multiple threads , If you also modify shared resources , Disorder may occur
Solution ： Lock -》 too troublesome
Atomic operation or transaction operation
De duplication in the project is to ensure the uniqueness of data , Idempotency can be used to solve , choose redis Of set Operation memory judgment and write consistent operation
Dimension connection ？
Dimension data is stored in other containers , When splicing, you need to obtain from other containers , Form new objects , Then return

Possible problems of dimension splicing ？

Field type mismatch , There is no value corresponding to the fact data in the dimension table , Null pointer exceptions may be reported
object . Property or object . Method

Double current Join How to realize ？

Ideas ：
The so-called "double flow" join Is to splice the data between two streams
If a stream key All in one partition ,join Two streams do not occur shuffle, The project is based on key To partition , So the same key All the data is in one partition

Possible problems ：

Data late , That is, the data of the first batch , Delayed data is sent to the second batch , So in order for the data to match correctly
Solutions ：
1、 The sliding window 、 Duplicate data
2、 Intermediate data is cached , If there is no data, put it in a container

Selection of containers ： Because the stream cannot wait , The data may be a few minutes late , You don't need to be caching all the time
Therefore, it is better to set the expiration time for containers , Project selection redis

Concrete realization ：
join The premise of two flows is k-v Types of flow , therefore join Need to use before map Change the structure

Double current join Whether there is data loss ？ How to avoid ？

If the data is late , The corresponding data in the cache has reached the expiration time , The data is coming , Data will be lost

Double current join Whether there is data duplication ？ How to avoid ？

If mainstream key, Multiple corresponding to the secondary flow key
Subflow key, Corresponding to the mainstream key

Business judgment is required

Does the program have shuffle, When will there be shuffle
To avoid data skew ,key Not in the same partition , Use join Data skew occurs when

from DWD->DWS There is no data loss in the process ？

There is , Write to the cache during de duplication , But writing ES Your program crashed , When the data comes in again , We'll check first redis Value , Find out redis It's worth it , Explain that the data came , The data will not continue to go down , This data will be lost ？

Solution ：
Adopt the method of state restoration to solve , Not sure redis Is the value inside accurate , It can be queried at the beginning of the program ES Values that already exist , Rewrite write based on existing values redis. There will be no data loss
from DWD->DWS There is no data duplication in the process ？
Double current Join

SparkStreaming The program is broken , Will the restart affect the business

Can't , Normal execution , Benefit from checking first ES In the writing redis The mechanism of , Data will not be lost ,
redis The role assumed in this part
1、 Offset
2、 Store de duplication data
3、 Store cached data
4、 Read dimension attributes
write in ES when , How to ensure that the data is accurate once ？
use put write in , Data De duplication

write in ES when ,redis Something went wrong , What are the consequences ？

1、 The de duplication data is gone
2、 Dimension data is gone Need to import again
3、 The cached data is gone

De duplicated data may be written repeatedly ES, Cached data may be lost
ES Planning for data
1、 Design the index template first
2、 An index every day
3、 Different businesses have different indexes

write in ES after , Is it over ？

Our goal is to provide users with intuitive data , The obtained data can be directly used by users

The third chapter ： Ad hoc inquiry

1、 What is ad hoc query

Ad hoc query is based on the user's own needs , Flexible selection of query conditions , The system generates the corresponding statistical report according to the user's choice .

2、OLAP The choice of , except ES Are there any other options ？

clickhouse kylin

3、 Briefly describe the following ES Structure ？ characteristic （ Inverted index ） The column type storage ？

 Forward index ： When a user initiates a request , The search engine will scan the documents in the index library , Find all documents that contain keywords , Then the method to find out whether the document contains keywords is called forward indexing    Such as sql Medium like

Inverted index ： When inserting data , Be able to segment data , Generate the document corresponding to the word segmentation ID list , When inquiring , Will get the word corresponding to ID Number

4、 Structure of inverted index ？

term index( Word index ) + term dictionary（ Dictionaries ） + posting list（ Document list ）
term index： In the memory
Binary tree structure Quickly locate the participle
term dictionary： In the memory
Store all existing words
posting list： In the disk file
Store the corresponding occurrence of a word ID list

5、ES Rules for storing slices ？

1、 Multiple partitions can be set for an index
2、 It cannot be changed after the partition is set during creation
3、 Partition allocation logic adopt hash(routing) % number_of_primary_shards among routing It's a variable value , The default is documentation id
4、 One ES Node heap memory , Not more than 32G, It is better not to exceed 1000 A shard

6、ES Writing process

1、 User request ES The node （ Coordinate nodes ）
2、 The coordination node will initiate a write request to the index master partition
3、 After the master partition is written successfully , Request sub partition write
4、 If all are written successfully , Send information to the coordination node , The coordination node returns to the user

7、ES Reading process

1、 Client sends request to node1
2、 The data to be read is fragmented 0 On , Because each node has shards 0 Copy of
3、 Forward the request to node2,node2 Return the result to node1, And then back to the client
When reading a copy , The data of the existing primary partition is written successfully , But the sub partition has not been written , In this case, query , Replica sharding will report that the document does not exist , But the main partition may return documents

8、ES Search process for

Two phases

1、query Query phase

a. Client access node3（ Coordinate nodes ）, And create a priority queue , The size is from+size
b. Request each shard copy in this index （ Master and copy sharding ）
c. Each shard will add the results of the local query to the sorted priority queue
d. Each fragment returns the document id And the values of all fields involved in sorting to the coordination node
e. The coordination node merges these values into its own priority queue for a global sort

2、Fetch Capture phase

a. The coordination node identifies which documents need to be retrieved and submitted to the relevant Shards GET request
b. Each fragment enriches this document （ Row content ）
c. After all documents are retrieved , The coordination node returns the result to the client

9、ES Physical submission process of data in ？

1、 Enter write request , If the data is not writable now , Will put the data into buffer in , Write at the same time translog in
2、translog The disk files will be synchronized according to the default settings , At this time, it will be returned to the requester that the processing is successful
3、1-2 Seconds later , perform refresh, The data will buffer The data in is collated into segments and submitted to the readable cache , At this time, it will not be written , Will be stored in the readable cache for a certain period of time
4、 The default is 30 minute , perhaps translog Reach the upper limit 512M,flush To disk
5、 The disk now has partition and segment files , Background cycle or manually merge these files into a larger fragmented file

3.11 es Segment merging process ？

10、ES Optimization of mid segment merging

1、 Submit in the background ： Affect performance 、 The effect is not very good
2、 Manual submission ： Submit once a day

11、es The difference between default word segmentation and Chinese word segmentation ？

ES By default, Chinese word segmentation is divided by one word , But this does not conform to the Chinese semantics , So we have a third party to help us with word segmentation

12、BI The difference between tools and visualization tools ？

BI It's a complete solution , It is used to effectively integrate the existing data in the enterprise
Visualization tools show the data in a graphical way , Users can customize how to display

Chapter four ：-- Advanced –

1、spark How to handle data skew in ？

The phenomenon ： Find a task Execution is slow
reason ： Some task The number of tasks performed is very large , It's often because join
solve ：1、 Pre aggregate raw data
2、 avoid shuffle
If you must use shuffle, You can set shuffle Parallelism
3、hive Conduct ETL cleaning
Filter out data that may cause data skew in advance
If the data is empty , Give the data a random key, Try to distribute evenly
4、 Local polymerization + Global aggregation
5、 Improve reduce Parallelism ( Improved reduce End task Number , Assigned to each task Data is reduced )
6、 take reduce join Change it to map join( The radio is small RDD Full data ＋map)

2、Spark Of shuffle species , Decide which to use under what circumstances shuffle

1、BypassMergeSortShuffleWriter
1、 There is no pre aggregation function
2、 The number of partitions aggregated is less than or equal to 200（ Default , You can configure the ）

2、UnsafeShuffleWriter
	1、 Serialized objects support relocation operations 
	2、 There is no pre aggregation function 
	3、 The number of partitions aggregated is less than or equal to 16777216  
3、SortShuffleWriter
	 Not satisfied with the above

stay Spark There are three shuffle Write , Namely BypassMergeSortShuffleWriter、UnsafeShuffleWriter、SortShuffleWriter. They correspond to three different kinds of shuffleHandle.

3、Spark What are the aggregation operators , What kind of operators should we try to avoid ？

groupbykey,reducebykey,countbykey,sortbykey, Avoid causing shuffle The operator of

4、spark on yarn Operation execution process

Draw twice by yourself

5、yarn-client and yarn cluster What's the difference?

yarn-cluster Suitable for production environment ; and yarn-client For interaction and debugging , That is to say, I want to see it quickly application Output .
driver The difference in where client driver On the client side
In a cluster environment ,driver stay AM in

6、Spark Why is it better than Mapreduce fast ？

1、 Eliminate redundant HDFS Reading and writing
2、 Eliminate redundant MapReduce Stage
3、JVM The optimization of the

7、Spark That part hangs easily ？ Why? ？

shuffle The stage is the easiest to hang up
So why did you hang up , This is actually the same as Spark Of shuffle Mechanism design , Suppose there's a shuffle The process , Altogether N individual task, That is to say, there will be N Nodes shuffle write（ That is to say mapper）,N Nodes shuffle read（reducer）,Spark yes all-to-all The design of the , For each of these reducer for , It needs to go all the way N individual mapper Pull the data it needs from , It is not difficult to see that the overall complexity is N*N Grade , When your task has 2000 individual task when , This task alone ,Spark There will be millions of network data transfers in the cluster .

8、spark Is it stream processing ？

No , Micro batch processing , The amount of data processed , Depends on the size of the window

9、RDD, DAG, Stage How to understand ？

RDD  Distributed datasets    Just a structure , One RDD There are multiple divisions , Real data is executed on partitions

Divisible ： Increase the consumption power , More suitable for concurrent computing , similar kafka Consumer consumption data ,“ One partition corresponds to one task, stay executor in , One core Corresponding to one task, In this way, concurrent computing is realized ”.
a. elastic ： change , variable .
a、 Storage elasticity ： It can automatically switch between disk and memory ;“shuffle Stage , Will save the data to disk , Avoid data overload , Cause the task to fail . A task is divided into many stages , Operation in each phase , Is based on memory .”
b、 Fault tolerant resilience ： Data loss can be recovered automatically ;
c、 Calculating elasticity ： Retry mechanism after calculation error ;
d、 Partition elasticity ： Dynamically change the number of partitions according to the calculation results .“ After every calculation , There may be less data , thus , It will cause data skew , By dynamically changing the number of partitions , In this way, the data can be evenly distributed in different partitions .”
b. immutable ： Similar to immutable sets
RDD Only the logic of the calculation is stored , Don't store data , The logic of computation is immutable , Once changed , The new RDD;
c. RDD ： An abstract class , You need a subclass to implement that , There are many data processing methods

10、RDD How to fault tolerance by updating records

Two ways
– The difference between caching and checkpointing

Cache Caching just saves the data , Do not cut off blood dependence .Checkpoint Checkpoints cut blood dependence .
Cache Cached data is usually stored on disk 、 Memory, etc , Low reliability .Checkpoint The data is usually stored in HDFS Equal fault tolerance 、 Highly available file system , High reliability .
It is suggested that checkpoint() Of RDD Use Cache cache , such checkpoint Of job Just from Cache Just read the data from the cache , Otherwise, it needs to be calculated again from the beginning RDD

11、 Wide dependence 、 How to understand ？

Consanguinity ： It refers to the execution order of different operators
Distinguish the correspondence between downstream operators and upstream operators
Wide dependence ： An upstream partition corresponds to n Downstream zones , There is shuffle

Narrow dependence ： An upstream partition only corresponds to a downstream partition , A downstream partition can correspond to multiple upstream partitions

12、Job and Task How to understand

– Several concepts

Application： Applications , Initialize a sparkContext There will be a application
job： An action operator will produce a job
stage： Number of wide dependencies + 1
task： One stage In phase , the last one RDD The number of partitions is task The number of .

– Special column ： If the action operator is called in the conversion operator , So there will also be... Inside the transformation operator job Submission of .

application > job > stage > task , Every floor is 1 Yes N The relationship between .

13、Spark The concept of lineage

It means blood relationship , That is, the execution relationship between operators
It is mainly used to realize data fault tolerance in distributed environment
RDD Lineage go by the name of RDD Operation diagram or RDD Dependency graph , yes RDD All the fathers RDD Graph . It's in RDD On the implementation transformations Function and create a logical execution plan （logical execution plan） Result
When the program crashes , Can rely on blood relationship to restore the operation

14、Transformation and action What is it? ？ difference ？ Here are some common methods

spark There are two kinds of operators in
1、 Conversion operator
map -》 Transform the data structure
flatmap-》 Flattening
filter -》 Filter
groupby -》 Group... According to specified rules
glom -》 Put partition data into a set
2、 Action operator
reduce -》 polymerization
collect -》 Pull the data of all partitions to driver in
count -》 return RDD Number of data in
first -》 return RDD The first element of
take -》 return RDD The first few data

15、 Briefly describe the cache cache、persist and checkpoint The difference between

cache (cache/persist)
cache and persist It's actually RDD Of the two API, also cache The underlying call is persist, One of the differences is cache Cannot display the specified cache mode , Can only be cached in memory , however persist You can specify the cache method , For example, it shows that the specified cache is in memory 、 Memory and disk and serialization etc . adopt RDD The cache of , It can be done later RDD Or based on this RDD Other derivatives of RDD Reusing these cached datasets in processing

Fault tolerance (checkpoint)
In essence, it's RDD Write to disk for checkpoint ( Usually checkpoint To HDFS On , At the same time, it makes use of hdfs High availability 、 High reliability, etc ). It's mentioned above that Spark lineage, But in the actual production environment , A business requirement can be very, very complex , Then many operators may be called , A lot of RDD, that RDD Between linage The chain will be long , Once something goes wrong , The cost of fault tolerance can be very high . here ,checkpoint The function of . Users can put important RDD checkpoint Come down , When something goes wrong , Just from the nearest checkpoint It's easy to start the operation again , Appoint checkpoint The address of [SparkContext.setCheckpointDir(“checkpoint The address of ”)], And then call RDD Of checkpoint It's a good way to do it .

checkpoint And cache/persist contrast
1、 All are lazy operation , Only action Only when the operator is triggered can the cache or checkpoint operation
（ Lazy load operations are Spark An important feature of the mission , Not just for Spark RDD It also applies to Spark sql And so on ）
2、cache It's just caching data , But it doesn't change. lineage. Usually in memory , More likely to lose data
3、 Change the original lineage, Generate a new CheckpointRDD. Usually stored in hdfs, Highly available and more

16、 describe repartition and coalesce The relationship and difference between

Relationship ：repartition Called at the bottom coalesce
difference ：repartition Generally used to add partitions Meeting shuffle
coalesce General users reduce partitions The default will not shuffle

17、Spark Broadcast variables and accumulators in

Broadcast variables ： Distributed shared read-only variables
Why introduce this ：
One Executor There are many. core, So you can execute multiple at the same time task, When Driver When you need to pass an object with a large amount of data , Because of each task All contain such a variable , thus , The data is in executor There are many copies in
Possible problems ：
executor Data redundancy in , Memory may overflow , If in shuffle Stage , Data transmission efficiency is particularly low
effect ：dirver Only phase executor Send an object in , all task Share this variable

accumulator ： Distributed sharing writes only variables
Why introduce this ： Do you want to pass shuffle Process operator to achieve data accumulation
effect ：driver towards executor Send accumulator , One executor There is only one accumulator ,executor After execution , Return to driver
driver Keep all executor The accumulator of ,driver Two by two polymerization in .executor Cannot access each other's accumulators

18、spark Memory optimization in

be based on java Of memory ,spark Optimized

Memory characteristics ： Memory can be occupied dynamically , Occurs between execution memory and storage memory
Mechanism ：
1、 Both the execution memory and the execution memory are full , The data will be written to the disk
2、 When the storage memory is full , Can occupy execution memory , When resources are not enough , The occupied resources will be discarded or written to the disk
3、 When the execution memory is full , Can occupy storage memory , When resources are not enough , The execution memory can only wait to be released , What it occupies cannot be eliminated

The fifth chapter ：-- Scene analysis –

1、 In very special circumstances , What if dimension data is slower than real-time data

1、 If the real-time data can only be operated with dimension data , Operations that block real-time data , The spin
2、 Put it in a buffer , And set a certain expiration time

2、 Time sharing shopping cart

from DWD-》DWS Handle
1、 What data is needed
The number of shopping carts in an hour ？
2、 Where to get the data
3、 How to deal with
3、 Time sharing trading
4、 Number of timeshare orders
5、 Timeshare payments
6、 Number of time-sharing comments
7、 Visit by region
8、 Trading by region
9、 Orders by region
10、 Regional data
11、 Amount by Region
12、 Ranking List
13、top20
14、sku
15、spu
16、 brand
17、 category
18、 Shopping volume
19、 Search key
20、 Risk control chargeback
21、 Negative comments on risk control
22、 Risk control complaint
23、 Risk control blacklist
24、 Maliciously picking up shopping rolls

原网站

版权声明
本文为[WK-WK]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206101229463766.html