当前位置:网站首页>Dataset and dataframe in sparksql
Dataset and dataframe in sparksql
2022-06-12 14:35:00 【A withered and yellow maple leaf】
List of articles
Dataset
Dataset It's a strong type , And type safe data containers , It also provides structured queries API And similar RDD The same imperative API .
Write the following query :
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
import spark.implicits._
val dataset: Dataset[People] = spark.createDataset(Seq(People("zhangsan", 9), People("lisi", 15)))
// The way 1: Processing through objects
dataset.filter(item => item.age > 10).show()
// The way 2: Handle through fields
dataset.filter('age > 10).show()
// The way 3: Through something like SQL To handle
dataset.filter("age > 10").show()
Dataset Bottom (InternalRow)
Dataset The bottom layer deals with the serialized form of objects , By looking at Dataset Generated physical execution plan , That is, the final treatment RDD, You can judge Dataset What form of data is processed at the bottom .
val dataset: Dataset[People] = spark.createDataset(Seq(People("zhangsan", 9), People("lisi", 15)))
val internalRDD: RDD[InternalRow] = dataset.queryExecution.toRdd
dataset.queryExecution.toRdd This API You can see Dataset The underlying execution RDD, This RDD The paradigm in is InternalRow ,InternalRow Also known as Catalyst Row , yes Dataset Underlying data structure , in other words , No matter what Dataset What is the paradigm of , Whether it's Dataset[Person] Or something else , The data structures processed at the bottom are InternalRow .
therefore , Dataset The generic object of before execution , Need to pass through Encoder Convert to InternalRow, Before typing , Need to put InternalRow adopt Decoder Convert to generic object .
- Dataset It's a Spark Components , The bottom is still RDD .
- Dataset Provides the ability to access a field in an object , Don't look like RDD Like, you have to operate on the whole object every time .
- Dataset and RDD Different , If you want to put the Dataset[T] To RDD[T], You need to Dataset At the bottom InternalRow Do the conversion , It is a relatively heavyweight operation .
DataFrame
DataFrame yes SparkSQL One in represents... In a relational database surface Functional abstraction , Its function is to make Spark It's easier to deal with large-scale structured data , commonly DataFrame Can handle structured data , Or semi-structured data , Because these two kinds of data can be obtained Schema Information , in other words DataFrame There is Schema Information , You can operate like an operation table DataFrame .
DataFrame It consists of two parts , One is row Set , Every row Object represents a row , The second is to describe DataFrame Structural Schema
DataFrame Support SQL The common operations in , for example : select, filter, join, group, sort, join etc.
Create by implicit transformation DF
This approach is essentially using SparkSession Implicit conversion in On going .
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
// You must import an implicit conversion
// Be careful : spark Not a package here , It is SparkSession object
import spark.implicits._
val peopleDF: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()
- toDF The method can be found in RDD and Seq Use in
- Create... From a collection DataFrame When , The collection can contain not only sample classes , You can also have only common data types , After that, create... By specifying the column name .
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
import spark.implicits._
val df1: DataFrame = Seq("nihao", "hello").toDF("text")
/* +-----+ | text| +-----+ |nihao| |hello| +-----+ */
df1.show()
val df2: DataFrame = Seq(("a", 1), ("b", 1)).toDF("word", "count")
/* +----+-----+ |word|count| +----+-----+ | a| 1| | b| 1| +----+-----+ */
df2.show()
- Create from an external collection DataFrame
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
val df = spark.read
.option("header", true)
.csv("dataset/BeijingPM20100101_20151231.csv")
df.show(10)
df.printSchema()
- Not only from csv File creation DataFrame, Can also from Table, JSON, Parquet Etc DataFrame
summary :
- 1.DataFrame Is a functional component similar to a relational database table .
- 2.DataFrame Generally deal with structured data and semi-structured data .
- 3.DataFrame With data objects Schema Information .
- 4. You can use imperative API operation DataFrame, It can also be used SQL operation DataFrame.
- 5.DataFrame Can be created directly from an existing collection , You can also read external data sources to create .
Dataset and DataFrame Similarities and differences
DataFrame Namely Dataset[Row]
The same thing :
- 1.Dataset Columns can be used to access data ,DataFrame It's fine too
- 2.Dataset The execution of is optimized ,DataFrame It's also
- 3.Dataset It has command mode API, It can also be used SQL To visit ,DataFrame You can also use these two different ways to access
Difference :
- 1.DataFrame The meaning of the expression is to support functional operation surface , and Dataset Expression is a similar RDD Things that are ,Dataset Can handle any object .
- 2.DataFrame What is stored in the is Row object , and Dataset Objects of any type can be stored in .
- DataFrame Namely Dataset[Row]
- Dataset The paradigm of can be any type
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
import spark.implicits._
val df: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()
val ds: Dataset[People] = Seq(People("zhangsan", 15), People("lisi", 15)).toDS()
3.DataFrame How to operate and Dataset It's the same , But for strongly typed operations , They deal with different types .
- DataFrame When performing strongly typed operations , for example map operator , The data type it processes is always Row
- But for Dataset Speaking of , What kind of , He just deals with what kind of
4.DataFrame Only runtime type checking can be done ,Dataset It can achieve type checking during compilation and runtime .
- 1.DataFrame The data stored in Row Express , One Row For a row of data , This is similar to a relational database
- 2.DataFrame It's going on map Wait for the operation ,DataFrame Can't be used directly Person In this way Scala object , So you can't do compile time checking
- 3.Dataset Represents a specific class of objects , for example Person, So... Again map Wait for the operation , What is passed in is a specific Scala object , If you call the wrong method , It will be checked at compile time
Row What is it? ?
- Row Object represents a That's ok
- Row The operation of is similar to Scala Medium Map data type
// An object is an object
val p = People(name = "zhangsan", age = 10)
// The same object , You can also use a Row Object to represent
val row = Row("zhangsan", 10)
// obtain Row The content in
println(row.get(1))
println(row(1))
// Type can be specified when getting
println(row.getAs[Int](1))
// meanwhile Row It is also a sample class , Can be done match
row match {
case Row(name, age) => println(name, age)
}
DataFrame and Dataset Mutual conversion between
Code demonstration :
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
import spark.implicits._
val df: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()
val ds_fdf: Dataset[People] = df.as[People]
val ds: Dataset[People] = Seq(People("zhangsan", 15), People("lisi", 15)).toDS()
val df_fds: DataFrame = ds.toDF()
summary :
- 1.DataFrame Namely Dataset, Their way is the same , They all support it API and SQL Two ways of operation
- 2.DataFrame Only in the form of an expression , Or access data in the form of columns , Only Dataset Support the operation of the whole object
- 3.DataFrame The data in is expressed as Row, It's a concept of line
How to understand RDD、DataFrame and Dataset( summary )
data structure RDD
- RDD(Resilient Distributed Datasets) It's called elastic distributed data sets , yes Spark The most basic data abstraction in , The source code is an abstract class , Representing one immutable 、 Divisible 、 The elements in it can be Parallel computing Set .
- Compile time type safety , But whether it's communication between clusters , still IO All operations need to serialize and deserialize the structure and data of the object , There are still large GC Performance overhead , Objects are frequently created and destroyed .
data structure DataFrame
- And RDD similar ,DataFrame It's a distributed data container , But it is more like a two-dimensional table in a database , In addition to data , The structure information of this data is also recorded ( namely schema).
- DataFrame It's also Lazy execution Of , Performance is better than RDD high ( Mainly because the execution plan has been optimized ).
- because DataFrame The data structure of each row is the same , And there are schema in ,Spark adopt schema You can read the data , So in communication and IO You only need to serialize and deserialize the data , And the structural part does not use .
- Spark Can serialize data in binary form to JVM Outside the pile (off-heap: Non heap ) Of memory , This memory is directly managed by the operating system , Will no longer be JVM Limitations and GC It's bothering me . however DataFrame Not type safe .
data structure Dataset
- Dataset yes DataFrame API An extension of , yes Spark The latest data abstraction , Combined with the RDD and DataFrame The advantages of .
- DataFrame=Dataset[Row](Row Represents the type of table structure information ),DataFrame Know only fields , But I don't know the field type , and Dataset It's a strong type of , Not only know the fields , And know the field type .
- Sample class CaseClass Used in Dataset Structure information of data defined in , Each attribute name in the sample class corresponds directly to Dataset Field name in .
- Dataset With type safety check , Also has the DataFrame The query optimization feature of , Codecs are also supported , You can avoid deserializing the entire object when you need to access data that is not on the heap , Improved efficiency .
RDD、DataFrame and DataSet The conversion between is as follows , Suppose you have a sample class :
case class Emp(name:String), transformation :
RDD The switch to DataFrame:rdd.toDF(“name”)
RDD The switch to Dataset:rdd.map(x => Emp(x)).toDS
DataFrame The switch to Dataset:df.as[Emp]
DataFrame The switch to RDD:df.rdd
Dataset The switch to DataFrame:ds.toDF
Dataset The switch to RDD:ds.rdd

RDD And DataFrame perhaps Dataset To operate , Need to introduce implicit conversion import spark.implicits._, Among them spark yes SparkSession The name of the object !
边栏推荐
- Socket model of punctual atom stm32f429 core board
- Postgresql14 installation and use tutorial
- Program analysis and Optimization - 6 loop optimization
- Vs2012: cannot assign a value of type 'char *' to an entity of type 'lpwstr'
- C secret script Chapter 3 (detailed explanation of string function) (Section 2)
- Introduction to functions (inline functions and function overloading)
- Nesting of C language annotations
- Word insert picture blocked by text
- JD scanning code to obtain cookies
- Sorting out the differences between ABS () and Fabs () in C language
猜你喜欢

Soft test (VI) Chrome browser installation selenium IDE

En langage C, la fonction principale appelle une autre fonction et assemble le Code pour comprendre

【OCR】AspriseOCR C# 英文、數字識別(中文不行)

Lua callinfo structure, stkid structure resolution

Leetcode 2176. Count equal and divisible pairs in an array

Autofac初学(1)

Can you believe it? It took me only two days to develop a management system

Recursive summary of learning function

NetCore结合CAP事件总线实现分布式事务——入门(1)

For cross-border e-commerce, the bidding strategy focusing more on revenue - Google SEM
随机推荐
And, or, not equal, operator
Appnium (II) installation and basic use of mitmproxy
Common DOS commands
Llvm 13.1 new pass plug-in form [for win]
Player practice 26 adding slider and window maximization
En langage C, la fonction principale appelle une autre fonction et assemble le Code pour comprendre
Easygui query the company name in the document
Two methods of QT using threads
Raspberry pie get temperature and send pictures to email
新技术:高效的自监督视觉预训练,局部遮挡再也不用担心!
Program analysis and Optimization - 6 loop optimization
面向优化科学研究领域的软件包
[wechat applet] 6.1 applet configuration file
C secret arts script Chapter 5 (structure) (Section 1)
C magic skill Chapter 4 (detailed explanation of memory function)
ADB control installation simulator
Player practice 19 xaudio turn on audio
[wechat applet] 3 The first wechat applet
Redis core configuration and advanced data types
Machine learning learning notes