当前位置：网站首页>Implementation of spark lazy list files

Implementation of spark lazy list files

2022-07-27 15:38:00 【wankunde】

background

about Spark partition table, It's generating HadoopFsRelation when , If partitionKeyFilters perhaps subqueryFilters When it's not empty ,HadoopFsRelation Of location: FileIndex The attribute is LazyFileIndex, In the end FileSourceScanExec call listFiles Only before LazyFileIndex convert to InMemoryFileIndex.
But if Spark partition table Of partition filter contain subquery, here Spark I think I can't push down , So I will skip using LazyFileIndex, stay listFiles when prunePartitions It will not filter out the task partition , It leads to a lot of useless operations .

If the partition filter condition is subquery, By default, all partitions will be taken back , Then partition filtering .

Generate HadoopFsRelation

stay DataSourceStrategy Of FindDataSourceTable Rule Will try to parse 'UnresolvedCatalogRelation Medium CatalogTable. stay DataSource.resolveRelation() Method attempts to table Node convert to HadoopFsRelation. If DataSource Incoming table It's a partition table ,fileCatalog Use CatalogFileIndex So as to facilitate the tailoring of later partitions . otherwise , Direct traversal access DataSource In the middle of Paths, Generate InMemoryFileIndex

// DataSource.resolveRelation()
      case (format: FileFormat, _) =>
        val useCatalogFileIndex = sparkSession.sqlContext.conf.manageFilesourcePartitions &&
          catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog &&
          catalogTable.get.partitionColumnNames.nonEmpty
        val (fileCatalog, dataSchema, partitionSchema) = if (useCatalogFileIndex) {
    
          val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
          val index = new CatalogFileIndex(
            sparkSession,
            catalogTable.get,
            catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
          (index, catalogTable.get.dataSchema, catalogTable.get.partitionSchema)
        } else {
    
          val globbedPaths = checkAndGlobPathIfNecessary(
            checkEmptyGlobPath = true, checkFilesExist = checkFilesExist)
          val index = createInMemoryFileIndex(globbedPaths)
          val (resultDataSchema, resultPartitionSchema) =
            getOrInferFileFormatSchema(format, () => index)
          (index, resultDataSchema, resultPartitionSchema)
        }

        HadoopFsRelation(
          fileCatalog,
          partitionSchema = partitionSchema,
          dataSchema = dataSchema.asNullable,
          bucketSpec = bucketSpec,
          format,
          caseInsensitiveOptions)(sparkSession)

Yes HadoopFsRelation Cut the partition

Community version code stay PruneFileSourcePartitions rule Chinese vs HadoopFsRelation Cut the partition .
Because the partition table generates CatalogFileIndex, adopt Plan Filter conditions with partition fields in , Used for partition clipping val prunedFileIndex = catalogFileIndex.filterPartitions(partitionKeyFilters)
If the filter conditions contain subquery, This filtering condition cannot be pushed down to Hive metastore, Resulting in very many returns partitions（ Corresponding multiple rootPaths）.
Pack the reduced partition conditions into InMemoryFileIndex .InMemoryFileIndex Execute when instantiating refresh0() Method , obtain rootPaths Download all the file information . because files Too much can lead to plan Parsing slows down , At the same time, it takes up a lot of Driver Memory .

// InMemoryFileIndex
  private def refresh0(): Unit = {
    
    val files = listLeafFiles(rootPaths)
    cachedLeafFiles =
      new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
    cachedPartitionSpec = null
  }

LazyFileIndex Optimize

Carmel Version is optimized ,CatalogFileIndex.lazyFilterPartitions(filters: Seq[Expression]) The return is PartitioningAwareFileIndex Subclass LazyFileIndex , This class does not actually HDFS Traverse .
stay FileSourceStrategy Will be HadoopFsRelation Convert to FileSourceScanExec . stay FileSourceScanExec Three of them lazy Variable : selectedPartitions, dynamicallySelectedPartitions, inputRDD
- selectedPartitions : adopt InMemoryFileIndex.listFiles() Return to the selected partition
- dynamicallySelectedPartitions : Use partition filter conditions that cannot be pushed down selectedPartitions Filter again
- inputRDD : Deal with the final partition , Generate FileScanRDD
We are visiting selectedPartitions When , Automatically put LazyFileIndex Replace with InMemoryFileIndex And carry on HDFS Traverse .

FileIndex Class inheritance relation

FileIndex
    CatalogFileIndex
        def lazyFilterPartitions(filters: Seq[Expression]): PartitioningAwareFileIndex

    PartitioningAwareFileIndex
        LazyFileIndex
            def createFileIndex(predicates: Seq[Expression]): InMemoryFileIndex

        InMemoryFileIndex
            def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory]

        MetadataLogFileIndex //  Non emphasis , Ignore first

Some discussions

The optimization point is that the traversal of file directories is put into plan analyze after , Documents needing attention are listed in hdfs Traversal on takes time , There is also the memory occupation of traversal file results .
TODO …

原网站

版权声明
本文为[wankunde]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271423226647.html