当前位置:网站首页>Knowledge * review
Knowledge * review
2022-07-06 23:23:00 【Daily Xiaoxin】
Knowledge review
1、 Please briefly HBase Data structure and storage architecture
data structure :hbase The data structure includes : Namespace , The line of key , Column cluster , Column , Time stamp , data ceil.
- Namespace : Similar to relational database , Space for storing tables
- The line of key : That is to say rowkey, Unique identification
- Column cluster : That is, a large class , A data set , The quantity is fixed
- Column : Column is a popular column , A column cluster can have multiple columns , Columns can be added
Time stamp : Each data update will be followed by a new timestamp , You can get the latest data through timestamp , It also solved hdfs Disadvantages that cannot be modified over time- data ceil: That is to say hbase The data in is all string type
Storage architecture :Client( client )、Master( Master node )、HRegionServer( maintain HRegion)、HLog( journal )、HRegion( Maintain several Store)、Store( Specific data storage )、StoreFile( Persist to HDFS)、MemStore( Memory storage )、Zookeeper( Monitor the cluster and storage root The mapping information )
2、 Please briefly HBase The process of querying data
First visit zookeeper, obtain meta Where are the documents HRegionServer The location of , obtain meta Memory for file loading , obtain rowkey Corresponding HRegion Due to the existence of multiple HRegion in , So create multiple HRegionScanner,StoreScanner Scanner , Scan first MemStore Whether it is stored in , Scan again StoreFile, Last result returned
3、 Please briefly HBase The process of writing inquiry data
Connect first , Append the write operation to HLog In the log , In obtaining zookeeper in meta File location information , obtain meta Specified in the rowkey The mapping of HRegion After the message , Write data , writes MemStore, Reach by default 128MB when , Perform a brush write to the hard disk , become StoreFile, With constant StoreFile More ,StoreFile It will merge data
4、 Please elaborate Spark Caching mechanism in cache and persist And checkpoint The difference and connection
cache:One of the control operators ,cache() = persist() = persist(StorageLevel.Memory_Only), amount to persist A case of , It belongs to inert loading , The cache will not be used for the first time, but only for the second operation ( The code implementation is as follows )persist:One of the control operators : Support persistence , Common patterns are Memory_Only and Memory_and_Diskcheckpoint:Mainly used for persistence RDD, Persist the results to specific files , Also inert loading ( The code implementation is as follows )- All three are control operators , A control and persistence of different forms of data , among cache Memory based ,checkpoints Based on hard disk , and persist The most comprehensive multiple modes can be realized
/* Control operator cache() Lazy loading */
object CtrlCache {
def main(args: Array[String]): Unit = {
// Create connection
val context = new SparkContext(new SparkConf().setMaster("local").setAppName("cache" + System.currentTimeMillis()))
// Get data element
val value: RDD[String] = context.textFile("src/main/resources/user.log")
// Start cache
value.cache()
// Recording time
val start: Long = System.currentTimeMillis()
// Count the number of data rows
val count: Long = value.count()
// Record the end time
val end: Long = System.currentTimeMillis()
// Output results
println(" The data is "+count+" That's ok , Time consuming :"+(end-start)+"s")
// Recording time
val start1: Long = System.currentTimeMillis()
// Count the number of data rows
val count1: Long = value.count()
// Record the end time
val end1: Long = System.currentTimeMillis()
// Output results
println(" The data is "+count1+" That's ok , Time consuming :"+(end1-start1)+"s")
}
}

/* Control operator checkpoint*/
object CheckPoint {
def main(args: Array[String]): Unit = {
// Create connection
val context = new SparkContext(new SparkConf().setMaster("local").setAppName("cache" + System.currentTimeMillis()))
// Get data element
val value: RDD[String] = context.textFile("src/main/resources/user.log")
// Set checkpoint path
context.setCheckpointDir("./point")
// Partition the data
val partiton: RDD[String] = value.flatMap(_.split(" "))
// Get the number of partitions
println(" Partition number :"+partiton.getNumPartitions)
// Persistence
value.checkpoint()
// Number of persistence
value.count()
context.stop()
}
}

5、RDD The five attributes of ? Please list the commonly used RDD Operator and action ?
- Five attributes :
① RDD By a group partition Partition composition
② RDD Interdependence between
③ RDD Calculate the best calculation position
④ The partition is used for key -value Of RDD On
⑤ Function acts on each partition- RDD Common operators and functions :
– Conversion operator :
map: In one out one , Data segmentation and other processing
flatMap: And map Similar to first map after flat, It is mostly used for partition
sortByKey: be used for k-vRDD On , Sort
reduceByKey: Will be the same Key Data processing
– Action operator :
count: Returns the number of elements in the dataset
foreach: Loop through each element in the dataset
collect: The calculation results are recycled to Driver End
– Control operator :cache ,persist,checkpoint( A little )
6、Spark What is the role of width dependence ?
Wide dependence :It means the father RDD And son RDD Between partition Partition relationship is one to many , And that leads to shuffle The generation of shuffleNarrow dependence :It means the father RDD And son RDD Between partition The relationship between partitions is one-to-one or many to one , Will not produce shuffle Shuffle operation
effect :spark Divide by width dependence stage
边栏推荐
- Efficient ETL Testing
- mysql拆分字符串作为查询条件的示例代码
- MySQL implementation of field segmentation from one line to multiple lines of example code
- Some suggestions for foreign lead2022 in the second half of the year
- 儿童睡衣(澳大利亚)AS/NZS 1249:2014办理流程
- Face recognition class attendance system based on paddlepaddle platform (easydl)
- [step on pit collection] attempting to deserialize object on CUDA device+buff/cache occupy too much +pad_ sequence
- js对JSON数组的增删改查
- OpenSSL: a full-featured toolkit for TLS and SSL protocols, and a general encryption library
- spark调优(二):UDF减少JOIN和判断
猜你喜欢

Enterprises do not want to replace the old system that has been used for ten years

Bipartite graph determination

Use mitmproxy to cache 360 degree panoramic web pages offline

Introduction to network basics

CUDA exploration

Cloud native technology container knowledge points

Coscon'22 community convening order is coming! Open the world, invite all communities to embrace open source and open a new world~

asp读取oracle数据库问题

AI表现越差,获得奖金越高?纽约大学博士拿出百万重金,悬赏让大模型表现差劲的任务...

Thinkphp5 multi table associative query method join queries two database tables, and the query results are spliced and returned
随机推荐
What can be done for traffic safety?
MySQL数据库之JDBC编程
[launched in the whole network] redis series 3: high availability of master-slave architecture
机器人材料整理中的套-假-大-空话
Redis persistence mechanism
Should the jar package of MySQL CDC be placed in different places in the Flink running mode?
#DAYU200体验官# 在DAYU200运行基于ArkUI-eTS的智能晾晒系统页面
安全保护能力是什么意思?等保不同级别保护能力分别是怎样?
dockermysql修改root账号密码并赋予权限
mysql连接vscode成功了,但是报这个错
Demonstration of the development case of DAPP system for money deposit and interest bearing financial management
这个『根据 op 值判断操作类型来自己组装 sql』是指在哪里实现?是指单纯用 Flink Tabl
AI金榜题名时,MLPerf榜单的份量究竟有多重?
The problem that dockermysql cannot be accessed by the host machine is solved
儿童睡衣(澳大利亚)AS/NZS 1249:2014办理流程
OpenSSL: a full-featured toolkit for TLS and SSL protocols, and a general encryption library
The worse the AI performance, the higher the bonus? Doctor of New York University offered a reward for the task of making the big model perform poorly
Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medi
ACL 2022 | small sample ner of sequence annotation: dual tower Bert model integrating tag semantics
B站大佬用我的世界搞出卷积神经网络,LeCun转发!爆肝6个月,播放破百万