当前位置：网站首页>Knowledge * review

Knowledge * review

2022-07-06 23:23:00 【Daily Xiaoxin】

Knowledge review

1、 Please briefly HBase Data structure and storage architecture

data structure ：hbase The data structure includes ： Namespace , The line of key , Column cluster , Column , Time stamp , data ceil.
Namespace : Similar to relational database , Space for storing tables
The line of key ： That is to say rowkey, Unique identification
Column cluster ： That is, a large class , A data set , The quantity is fixed
Column ： Column is a popular column , A column cluster can have multiple columns , Columns can be added
Time stamp ： Each data update will be followed by a new timestamp , You can get the latest data through timestamp , It also solved hdfs Disadvantages that cannot be modified over time
data ceil: That is to say hbase The data in is all string type
Storage architecture ：Client（ client ）、Master（ Master node ）、HRegionServer（ maintain HRegion)、HLog( journal ）、HRegion（ Maintain several Store)、Store( Specific data storage ）、StoreFile（ Persist to HDFS)、MemStore( Memory storage ）、Zookeeper（ Monitor the cluster and storage root The mapping information ）

2、 Please briefly HBase The process of querying data

First visit zookeeper, obtain meta Where are the documents HRegionServer The location of , obtain meta Memory for file loading , obtain rowkey Corresponding HRegion Due to the existence of multiple HRegion in , So create multiple HRegionScanner,StoreScanner Scanner , Scan first MemStore Whether it is stored in , Scan again StoreFile, Last result returned

3、 Please briefly HBase The process of writing inquiry data

Connect first , Append the write operation to HLog In the log , In obtaining zookeeper in meta File location information , obtain meta Specified in the rowkey The mapping of HRegion After the message , Write data , writes MemStore, Reach by default 128MB when , Perform a brush write to the hard disk , become StoreFile, With constant StoreFile More ,StoreFile It will merge data

4、 Please elaborate Spark Caching mechanism in cache and persist And checkpoint The difference and connection

cache: One of the control operators ,cache() = persist() = persist(StorageLevel.Memory_Only), amount to persist A case of , It belongs to inert loading , The cache will not be used for the first time, but only for the second operation （ The code implementation is as follows ）
persist: One of the control operators ： Support persistence , Common patterns are Memory_Only and Memory_and_Disk
checkpoint: Mainly used for persistence RDD, Persist the results to specific files , Also inert loading （ The code implementation is as follows ）
All three are control operators , A control and persistence of different forms of data , among cache Memory based ,checkpoints Based on hard disk , and persist The most comprehensive multiple modes can be realized

/* Control operator cache() Lazy loading */
object CtrlCache {
    
  def main(args: Array[String]): Unit = {
    
    // Create connection 
    val context = new SparkContext(new SparkConf().setMaster("local").setAppName("cache" + System.currentTimeMillis()))
    // Get data element 
    val value: RDD[String] = context.textFile("src/main/resources/user.log")
    // Start cache 
    value.cache()
    // Recording time 
    val start: Long = System.currentTimeMillis()
    // Count the number of data rows 
    val count: Long = value.count()
    // Record the end time 
    val end: Long = System.currentTimeMillis()
    // Output results 
    println(" The data is "+count+" That's ok , Time consuming ："+(end-start)+"s")
    // Recording time 
    val start1: Long = System.currentTimeMillis()
    // Count the number of data rows 
    val count1: Long = value.count()
    // Record the end time 
    val end1: Long = System.currentTimeMillis()
    // Output results 
    println(" The data is "+count1+" That's ok , Time consuming ："+(end1-start1)+"s")

  }
}

Insert picture description here

/* Control operator  checkpoint*/
object CheckPoint {
    
  def main(args: Array[String]): Unit = {
    
    // Create connection 
    val context = new SparkContext(new SparkConf().setMaster("local").setAppName("cache" + System.currentTimeMillis()))
    // Get data element 
    val value: RDD[String] = context.textFile("src/main/resources/user.log")
    // Set checkpoint path 
    context.setCheckpointDir("./point")
    // Partition the data 
    val partiton: RDD[String] = value.flatMap(_.split(" "))
    // Get the number of partitions 
    println(" Partition number ："+partiton.getNumPartitions)
    // Persistence 
    value.checkpoint()
    // Number of persistence 
    value.count()
    context.stop()
  }
}

Insert picture description here

5、RDD The five attributes of ？ Please list the commonly used RDD Operator and action ？

Five attributes ：
① RDD By a group partition Partition composition
② RDD Interdependence between
③ RDD Calculate the best calculation position
④ The partition is used for key -value Of RDD On
⑤ Function acts on each partition
RDD Common operators and functions ：
– Conversion operator ：
map： In one out one , Data segmentation and other processing
flatMap： And map Similar to first map after flat, It is mostly used for partition
sortByKey： be used for k-vRDD On , Sort
reduceByKey: Will be the same Key Data processing
– Action operator ：
count： Returns the number of elements in the dataset
foreach： Loop through each element in the dataset
collect： The calculation results are recycled to Driver End
– Control operator ：cache ,persist,checkpoint（ A little ）

6、Spark What is the role of width dependence ？

Wide dependence ： It means the father RDD And son RDD Between partition Partition relationship is one to many , And that leads to shuffle The generation of shuffle
Narrow dependence ： It means the father RDD And son RDD Between partition The relationship between partitions is one-to-one or many to one , Will not produce shuffle Shuffle operation
effect ：spark Divide by width dependence stage

原网站

版权声明
本文为[Daily Xiaoxin]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202131038160514.html