当前位置：网站首页>Dataset and dataframe in sparksql

Dataset and dataframe in sparksql

2022-06-12 14:35:00 【A withered and yellow maple leaf】

List of articles

Dataset
- Dataset Bottom （InternalRow）
DataFrame
- Create by implicit transformation DF
Dataset and DataFrame Similarities and differences
How to understand RDD、DataFrame and Dataset（ summary ）

Dataset

Dataset It's a strong type , And type safe data containers , It also provides structured queries API And similar RDD The same imperative API .

Write the following query ：

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

import spark.implicits._

val dataset: Dataset[People] = spark.createDataset(Seq(People("zhangsan", 9), People("lisi", 15)))
//  The way 1:  Processing through objects 
dataset.filter(item => item.age > 10).show()
//  The way 2:  Handle through fields 
dataset.filter('age > 10).show()
//  The way 3:  Through something like SQL To handle 
dataset.filter("age > 10").show()

Dataset Bottom （InternalRow）

Dataset The bottom layer deals with the serialized form of objects , By looking at Dataset Generated physical execution plan , That is, the final treatment RDD, You can judge Dataset What form of data is processed at the bottom .

val dataset: Dataset[People] = spark.createDataset(Seq(People("zhangsan", 9), People("lisi", 15)))
val internalRDD: RDD[InternalRow] = dataset.queryExecution.toRdd

dataset.queryExecution.toRdd This API You can see Dataset The underlying execution RDD, This RDD The paradigm in is InternalRow ,InternalRow Also known as Catalyst Row , yes Dataset Underlying data structure , in other words , No matter what Dataset What is the paradigm of , Whether it's Dataset[Person] Or something else , The data structures processed at the bottom are InternalRow .

therefore , Dataset The generic object of before execution , Need to pass through Encoder Convert to InternalRow, Before typing , Need to put InternalRow adopt Decoder Convert to generic object .
Insert picture description here

Dataset It's a Spark Components , The bottom is still RDD .
Dataset Provides the ability to access a field in an object , Don't look like RDD Like, you have to operate on the whole object every time .
Dataset and RDD Different , If you want to put the Dataset[T] To RDD[T], You need to Dataset At the bottom InternalRow Do the conversion , It is a relatively heavyweight operation .

DataFrame

DataFrame yes SparkSQL One in represents... In a relational database surface Functional abstraction , Its function is to make Spark It's easier to deal with large-scale structured data , commonly DataFrame Can handle structured data , Or semi-structured data , Because these two kinds of data can be obtained Schema Information , in other words DataFrame There is Schema Information , You can operate like an operation table DataFrame .
Insert picture description here
DataFrame It consists of two parts , One is row Set , Every row Object represents a row , The second is to describe DataFrame Structural Schema

DataFrame Support SQL The common operations in , for example : select, filter, join, group, sort, join etc.

Create by implicit transformation DF

This approach is essentially using SparkSession Implicit conversion in On going .

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

//  You must import an implicit conversion 
//  Be careful : spark  Not a package here ,  It is  SparkSession  object 
import spark.implicits._

val peopleDF: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()

toDF The method can be found in RDD and Seq Use in
Create... From a collection DataFrame When , The collection can contain not only sample classes , You can also have only common data types , After that, create... By specifying the column name .

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

import spark.implicits._

val df1: DataFrame = Seq("nihao", "hello").toDF("text")

/* +-----+ | text| +-----+ |nihao| |hello| +-----+ */
df1.show()

val df2: DataFrame = Seq(("a", 1), ("b", 1)).toDF("word", "count")

/* +----+-----+ |word|count| +----+-----+ | a| 1| | b| 1| +----+-----+ */
df2.show()

Create from an external collection DataFrame

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

val df = spark.read
  .option("header", true)
  .csv("dataset/BeijingPM20100101_20151231.csv")
df.show(10)
df.printSchema()

Not only from csv File creation DataFrame, Can also from Table, JSON, Parquet Etc DataFrame

summary ：

1.DataFrame Is a functional component similar to a relational database table .
2.DataFrame Generally deal with structured data and semi-structured data .
3.DataFrame With data objects Schema Information .
4. You can use imperative API operation DataFrame, It can also be used SQL operation DataFrame.
5.DataFrame Can be created directly from an existing collection , You can also read external data sources to create .

Dataset and DataFrame Similarities and differences

DataFrame Namely Dataset[Row]

The same thing ：

1.Dataset Columns can be used to access data ,DataFrame It's fine too
2.Dataset The execution of is optimized ,DataFrame It's also
3.Dataset It has command mode API, It can also be used SQL To visit ,DataFrame You can also use these two different ways to access

Difference ：

1.DataFrame The meaning of the expression is to support functional operation surface , and Dataset Expression is a similar RDD Things that are ,Dataset Can handle any object .
2.DataFrame What is stored in the is Row object , and Dataset Objects of any type can be stored in .
- DataFrame Namely Dataset[Row]
- Dataset The paradigm of can be any type

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

import spark.implicits._

val df: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()       

val ds: Dataset[People] = Seq(People("zhangsan", 15), People("lisi", 15)).toDS()

3.DataFrame How to operate and Dataset It's the same , But for strongly typed operations , They deal with different types .
- DataFrame When performing strongly typed operations , for example map operator , The data type it processes is always Row
- But for Dataset Speaking of , What kind of , He just deals with what kind of
4.DataFrame Only runtime type checking can be done ,Dataset It can achieve type checking during compilation and runtime .
- 1.DataFrame The data stored in Row Express , One Row For a row of data , This is similar to a relational database
- 2.DataFrame It's going on map Wait for the operation ,DataFrame Can't be used directly Person In this way Scala object , So you can't do compile time checking
- 3.Dataset Represents a specific class of objects , for example Person, So... Again map Wait for the operation , What is passed in is a specific Scala object , If you call the wrong method , It will be checked at compile time

Row What is it? ？

Row Object represents a That's ok
Row The operation of is similar to Scala Medium Map data type

//  An object is an object 
val p = People(name = "zhangsan", age = 10)

//  The same object ,  You can also use a  Row  Object to represent 
val row = Row("zhangsan", 10)

//  obtain  Row  The content in 
println(row.get(1))
println(row(1))

//  Type can be specified when getting 
println(row.getAs[Int](1))

//  meanwhile  Row  It is also a sample class ,  Can be done  match
row match {
    
  case Row(name, age) => println(name, age)
}

DataFrame and Dataset Mutual conversion between

Code demonstration ：

val spark: SparkSession = new sql.SparkSession.Builder()
  .appName("hello")
  .master("local[6]")
  .getOrCreate()

import spark.implicits._

val df: DataFrame = Seq(People("zhangsan", 15), People("lisi", 15)).toDF()
val ds_fdf: Dataset[People] = df.as[People]

val ds: Dataset[People] = Seq(People("zhangsan", 15), People("lisi", 15)).toDS()
val df_fds: DataFrame = ds.toDF()

summary ：

1.DataFrame Namely Dataset, Their way is the same , They all support it API and SQL Two ways of operation
2.DataFrame Only in the form of an expression , Or access data in the form of columns , Only Dataset Support the operation of the whole object
3.DataFrame The data in is expressed as Row, It's a concept of line

How to understand RDD、DataFrame and Dataset（ summary ）

data structure RDD
- RDD（Resilient Distributed Datasets） It's called elastic distributed data sets , yes Spark The most basic data abstraction in , The source code is an abstract class , Representing one immutable 、 Divisible 、 The elements in it can be Parallel computing Set .
- Compile time type safety , But whether it's communication between clusters , still IO All operations need to serialize and deserialize the structure and data of the object , There are still large GC Performance overhead , Objects are frequently created and destroyed .
data structure DataFrame
- And RDD similar ,DataFrame It's a distributed data container , But it is more like a two-dimensional table in a database , In addition to data , The structure information of this data is also recorded （ namely schema）.
- DataFrame It's also Lazy execution Of , Performance is better than RDD high （ Mainly because the execution plan has been optimized ）.
- because DataFrame The data structure of each row is the same , And there are schema in ,Spark adopt schema You can read the data , So in communication and IO You only need to serialize and deserialize the data , And the structural part does not use .
- Spark Can serialize data in binary form to JVM Outside the pile （off-heap： Non heap ） Of memory , This memory is directly managed by the operating system , Will no longer be JVM Limitations and GC It's bothering me . however DataFrame Not type safe .
data structure Dataset
- Dataset yes DataFrame API An extension of , yes Spark The latest data abstraction , Combined with the RDD and DataFrame The advantages of .
- DataFrame=Dataset[Row]（Row Represents the type of table structure information ）,DataFrame Know only fields , But I don't know the field type , and Dataset It's a strong type of , Not only know the fields , And know the field type .
- Sample class CaseClass Used in Dataset Structure information of data defined in , Each attribute name in the sample class corresponds directly to Dataset Field name in .
- Dataset With type safety check , Also has the DataFrame The query optimization feature of , Codecs are also supported , You can avoid deserializing the entire object when you need to access data that is not on the heap , Improved efficiency .

RDD、DataFrame and DataSet The conversion between is as follows , Suppose you have a sample class ：
case class Emp(name:String), transformation ：

RDD The switch to DataFrame：rdd.toDF(“name”)
RDD The switch to Dataset：rdd.map(x => Emp(x)).toDS
DataFrame The switch to Dataset：df.as[Emp]
DataFrame The switch to RDD：df.rdd
Dataset The switch to DataFrame：ds.toDF
Dataset The switch to RDD：ds.rdd

Insert picture description here
RDD And DataFrame perhaps Dataset To operate , Need to introduce implicit conversion import spark.implicits._, Among them spark yes SparkSession The name of the object ！