当前位置：网站首页>Spark data format unsafe row

Spark data format unsafe row

2022-07-26 17:48:00 【InfoQ】

1. brief introduction

UnsafeRow yes InternalRow Subclasses of , It represents a

Variable based on raw memory (raw-memory) Binary line format

, Simply speaking UnsafeRow Represents a line of records , Used in substitution java object （ Belong to Tungsten Part of the plan , It can reduce memory usage and GC expenses ）InternalRow：spark sql An abstract class used internally to represent rows , The corresponding output lines are

org.apache.spark.sql.Row/GenericRow/GenericRowWithSchema

UnsafeRow yes DataSet The underlying data model , be based on Encoder Conduct encode/decode

2. Class properties

private Object baseObject; // The entire row of data is stored on this object , Generally, it is a byte array byte[], Under what circumstances are other types ？

private long baseOffset; //baseObject Even an array , But it's also a java object ,baseOffset Record baseObject Type of object header Occupied memory space , Array objects in 64 position jvm In general 16

private int numFields; // The number of fields in a row

private int sizeInBytes;// It records the number of bytes occupied by the current row of data =baseObject Total capacity - baseOffset - Unused capacity , If there is string Equal variable length type field , The memory allocated may be larger than the actual ）

private int bitSetWidthInBytes; // The number of bytes used to record the null field , Each byte takes up 1bit, therefore 64 Within fields 1 byte ,65-128 Fields take up 2 byte , And so on

public static final Set<DataType> mutableFieldTypes; // stay UnsafeRow Field types that can be modified in , Because this part of the type is baseObject Is stored in a fixed location with a fixed length , So you can modify ; Variable types share ：NullType,BooleanType,ByteType,ShortType,IntegerType,LongType,FloatType,DoubleType,DateType,TimestampType,DecimalType

3. Distribution of memory

null bit set： Used to indicate that those fields are null value , One field occupies 1bit, For total size bitSetWidthInBytes Express ： size =(( Number of fields + 63)/ 64) * 8;

values: In the area , Each field will occupy 8 Bytes , Each field has been assigned when initializing . If it's a variable type (mutableFieldTypes) Field of , Store the value of this field directly ; If the field is an immutable type , Then only the offset( With baseOffset Is the relative offset of the benchmark , Not relative base address baseObject) And size, The two are merged into one long type （ high 32 Position as offset, low 32 Position as size）, The actual value is stored in
variable length portion

variable length portion： The specific value data of all immutable fields are stored adjacent , There may be some space left

Is it convenient to calculate each field based on memory alignment offset That's why it's used uniformly 8 Bytes , Otherwise, some types such as ShortType Also used 8 Will bytes waste part of memory .

4. UnsafeRow The creation process

Use the following code to generate a UnsafeRow：

case class Person(id: Long, id2: Long, id3: String)
val e = Encoders.product[Person]
val personExprEncoder = e.asInstanceOf[ExpressionEncoder[Person]]
val person = Person(2, 7,&quot;abcdefghijklmnopqrst&quot;)
val row = personExprEncoder.toRow(person) // This is a UnsafeRow object , And baseObject by byte[64], For why 64, The following analysis 
println(row.getLong(0))
println(row.getString(2))

from

toRow

The way to follow in ,UnsafeRow By UnsafeProjection Generated

abstract class UnsafeProjection extends Projection {
 override def apply(row: InternalRow): UnsafeRow
}

and UnsafeProjection It is an abstract class and has no concrete implementation subclasses , Subclass SpecificUnsafeProjection It's through

GenerateUnsafeProjection#create

Dynamically generate and instantiate

class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {

private Object[] references;
private boolean resultIsNull_0;
private boolean globalIsNull_0;
private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[] mutableStateArray_2 = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[1];
private java.lang.String[] mutableStateArray_0 = new java.lang.String[1];
private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] mutableStateArray_3 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
private UnsafeRow[] mutableStateArray_1 = new UnsafeRow[1]; 

public SpecificUnsafeProjection(Object[] references) {
this.references = references;
mutableStateArray_1[0] = new UnsafeRow(3); // establish UnsafeRow example ,3 A field ：id,id2,id3
mutableStateArray_2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mutableStateArray_1[0], 32);
mutableStateArray_3[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mutableStateArray_2[0], 3);

}
 public UnsafeRow apply(InternalRow i) {
mutableStateArray_2[0].reset(); 
mutableStateArray_3[0].zeroOutNullBytes();
writeFields_0_0(i); 
writeFields_0_1(i);
mutableStateArray_1[0].setTotalSize(mutableStateArray_2[0].totalSize());
return mutableStateArray_1[0];
}
// Initialize fields id3
private void writeFields_0_1(InternalRow i) {
UTF8String value_13 = StaticInvoke_0(i);
if (globalIsNull_0) {
mutableStateArray_3[0].setNullAt(2);
} else {
mutableStateArray_3[0].write(2, value_13);
}
}
 // Initialize fields id,id1
private void writeFields_0_0(InternalRow i) {
boolean isNull_3 = i.isNullAt(0);
com.test.scala.EncoderScala$Person value_3 = isNull_3 ? null : ((com.test.scala.EncoderScala$Person) i.get(0, null));
long value_0 = value_3.id();
if (isNull_0) {
mutableStateArray_3[0].setNullAt(0);
} else {
mutableStateArray_3[0].write(0, value_0);
}
com.test.scala.EncoderScala$Person value_7 = isNull_7 ? null : ((com.test.scala.EncoderScala$Person) i.get(0, null));
long value_4 = value_7.id2();
if (isNull_4) {
mutableStateArray_3[0].setNullAt(1);
} else {
mutableStateArray_3[0].write(1, value_4);
}
}
}

Only part of the code is retained , You can see ,UnsafeRow When the instance is created, only the representation Person The number of attributes 3, Then it is created as a construction parameter BufferHolder, This class is used to assist UnsafeRow The initialization , Dynamically increase memory and record actual memory usage (cursor)

public BufferHolder(UnsafeRow row, int initialSize) { 
 int bitsetWidthInBytes = UnsafeRow.calculateBitSetWidthInBytes(row.numFields());
 if (row.numFields() > (ARRAY_MAX - initialSize - bitsetWidthInBytes) / 8) {
 throw new UnsupportedOperationException(
 &quot;Cannot create BufferHolder for input UnsafeRow because there are &quot; +
 &quot;too many fields (number of fields: &quot; + row.numFields() + &quot;)&quot;);
 }
 this.fixedSize = bitsetWidthInBytes + 8 * row.numFields(); // Fixed length  
 this.buffer = new byte[fixedSize + initialSize]; // namely UnsafeRow.baseObject
 this.row = row;
 this.row.pointTo(buffer, buffer.length);
}

initialSize Passed a message 32 Come on , This value is

GenerateUnsafeProjection#createCode

Generated in the numVarLenFields * 32, That is, the field assignment of each variable type 32 byte （32 Just estimated , In the sample code id3 Value only uses more than 20 bytes , Insufficient initialization value will dynamically expand memory ）;

 Initial memory  = fixedSize + initialSize 
 = (bitsetWidthInBytes + 8* Total number of fields ) + ( Variable number of fields *32)
 = 8+8*3+1*32
 = 64

BufferHolder Object's cursor Property records the current memory used offset , After the object is built, it will be reset by

baseOffset+fixedSize

here UnsafeRow The instance has been created and allocated initialization memory , The next step is to put id,id2,id3 The values of the three fields are initialized into UnsafeRow, namely

SpecificUnsafeProjection#writeFields_0_0/writeFields_0_1->UnsafeRowWriter#write

For variable type fields, see 1 A field id, First calculate the absolute offset
offset=baseOffset + bitSetWidthInBytes + 0 * 8L
, Then write directly to this position , Corresponding values Section of the area 1 individual 8 byte

For fields of immutable type, such as 3 A field id3, The process of writing is as follows ：

public void write(int ordinal, UTF8String input) {
 final int numBytes = input.numBytes(); // Calculation id3 Bytes of ,20 Letters , Occupy 20 byte 
 final int roundedSize = ByteArrayMethods.roundNumberOfBytesToNearestWord(numBytes); // Need to be for 8 Number of digits ,32>=20,32 Bytes will be allocated to id3 Value 
 holder.grow(roundedSize); // Expanding memory dynamically , just initialSize by 32, So there is no need to expand this time 
 zeroOutPaddingBytes(numBytes);
 input.writeToMemory(holder.buffer, holder.cursor); //id3 For the first immutable field , therefore cursor It just points to variable length portion The starting position of the area 48
 setOffsetAndSize(ordinal, numBytes); // Set up id3 Relative offset of offset=(cursor-baseOffset)=32 and size=numBytes=20
 holder.cursor += roundedSize; //cursor Move back 32 byte , Representing the next immutable field offset
}

id3 Why use a relative offset of offset, Add again when reading the value baseOffset, Why not save the absolute offset directly

UnsafeRow Initialization complete , At this time, the memory should be as follows ：

Combined with the memory condition and the initialization process of data , The reading process is easy to understand , Whether variable or immutable , The offset is determined first , Then the memory reads

5. serialize

UnsafeRow Realization java Of

Externalizable

Interface and kryo Of

KryoSerializable

Interface

@Override
public void writeExternal(ObjectOutput out) throws IOException {
 byte[] bytes = getBytes();
 out.writeInt(bytes.length);
 out.writeInt(this.numFields);
 out.write(bytes);
}
@Override
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
 this.baseOffset = BYTE_ARRAY_OFFSET;
 this.sizeInBytes = in.readInt();
 this.numFields = in.readInt();
 this.bitSetWidthInBytes = calculateBitSetWidthInBytes(numFields);
 this.baseObject = new byte[sizeInBytes];
 in.readFully((byte[]) baseObject);
}

Both serialization and deserialization methods are directly for byte arrays io, Therefore, there is no need to java Object to byte stream , Greatly reduce the consumption of serialization

Serialization , No need. UnsafeRow The object itself is serialized into a binary stream , Put... Directly baseOject This binary array can be input into the stream .

During deserialization, the binary array is also read directly from the input stream to UnsafeRow In the object

6. summary

Data is stored in byte arrays , Reduce java Object to reduce additional memory overhead

java Object reduction , And less gc The cost of

shuffle When the process data is transmitted through the network , Data eliminates serialization and deserialization , And the data transmission size is also greatly reduced

7. Reference resources

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-UnsafeRow.html?q=https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Encoder.html?q=https://zhuanlan.zhihu.com/p/160799966

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207261656472718.html