当前位置：网站首页>New exploration of meta company | reduce Presto latency by using alluxio data cache

New exploration of meta company | reduce Presto latency by using alluxio data cache

2022-06-10 14:31:00 【InfoQ】

Overview

Meta company （ front “Facebook company ”, Hereinafter collectively “Meta”） Of Presto The team has been working with Alluxio Cooperation for Presto Provide open source data caching solutions . The scheme is used for Meta Multiple use cases for , So as to reduce the cost from HDFS Wait for the query delay caused by the remote data source scanning data . Experimental proof , Use Alluxio After data caching , Query latency and IO Scans are significantly optimized .

We found that ,Meta Multiple use cases in an architectural environment benefit from Alluxio Data caching . With Meta An internal use case of , The query delay of each quantile is reduced 33%（P50）、54%（P75） and 48%（P95）. Besides , Remote data source scanning data IO Improved performance 57%.

Presto framework

Presto Our architecture allows storage and computing to scale independently , However, scanning data in remote storage may incur expensive operating costs , It is also difficult to meet the low latency requirements of interactive queries .

Presto worker Only responsible for independent （ Usually the remote end ） The data scanned by the data source executes the query plan fragment , It does not store data from any remote data source , Therefore, the calculation can be elastically extended .

The following architecture diagram clearly shows the remote HDFS The path to read data . Every worker Will independently read data from remote data sources , This article will only discuss the optimization of remote data source read operations .

Presto + Data cache architecture

In order to solve the sub second delay problem of use cases , We decided to do a variety of optimizations , One of the most important optimizations is to implement data caching . Data caching is a traditional optimization method , It can make the working data set closer to the computing node , Reduce access to remote storage , This reduces latency and saves IO expenses .

The difficulty lies in , If accessed from a remote data source PB If there is no fixed access mode for level data , How to achieve effective data caching , Besides , Another requirement for effective data caching is in Presto Realize data affinity in equally distributed environment .

After adding the data caching function ,Presto The architecture of is as follows ：

About data caching , More details will follow .

Soft affinity scheduling

Presto The current scheduler has already allocated the partition worker Taking into account the load of , Therefore, the scheduling strategy makes the workload in worker Evenly distributed between . But from the perspective of data locality , Shards are randomly assigned , There is no guarantee of any affinity , Affinity is just a prerequisite for effective data caching . Yes Coordinator for , Assign slices to the same... Each time worker crucial , Because it's time to worker The data required for fragmentation may have been cached .

The figure above illustrates how affinity scheduling gives worker Allocate partitioned .

When executing the soft affinity scheduling policy , As far as possible, the same partition will be assigned to the same worker. The soft affinity scheduler uses the hash value of the shard to select a preference for the shard worker, Soft affinity scheduler :

* Calculate its preferred for slicing worke. If preferred worker There are sufficient resources available , Then the scheduler will assign the partition to the preferred partition worker.* If preferred worker In a busy state , Then the scheduler will choose an alternative worker, If this alternative worker There are sufficient resources available , The scheduler will assign shards to it .* If the alternative worker Also in a busy state , Then the scheduler will allocate the partition to the currently most idle worker.

Determining whether a node is busy can be defined by two configurations ：

* Maximum number of slices per node ：node-scheduler.max-splits-per-node

* Single task maximum pending partition （pending split） Count : node-scheduler.max-pending-splits-per-task

When the number of slices on a node exceeds the limit of any of the above configurations , This node is considered a busy node . It can be seen that , Node affinity is critical to cache effectiveness . If there is no node affinity , The same slice may be divided into different pieces at different times worker Handle , This can lead to redundant fragmented data caches .

For this reason , If the affinity scheduler fails to allocate shards to the preferred worker Or alternatively worker（ Because they are busy ）, The scheduler will send the assigned worker Signal , Let it not cache fragmented data . This means that only the first choice or alternative for slicing worker Will cache the data of the partition .

Data caching

Alluxio The file system is a frequently used Presto An open source data orchestration system for distributed caching services . To achieve sub second query latency in the architecture , We hope to further reduce Presto and Alluxio Communication overhead between . therefore , Alluxio and Presto Our core team , from Alluxio A single node embedded cache library is developed in the service .

To be specific ,Presto worker standard-passing HDFS The interface query is located in the same JVM Internal Alluxio Local cache . When cache hits ,Alluxio The local cache reads data directly from the local disk , And return the cached data to Presto; otherwise ,Alluxio Will access the remote data source , And cache the data on the local disk , For further inquiry . This cache pair Presto It is completely transparent . Once there is a problem with the cache （ If the local disk fails ）,Presto You can also read data directly from remote data sources , The workflow is shown in the figure below ：

Internal composition and configuration of local cache

Alluxio The data cache is located at Presto worker Libraries on nodes , It provides a way to communicate with HDFS Compatible interface “AlluxioCachingFileSystem” , As Presto worker The main interface for all data access operations .Alluxio The data cache contains design options such as ：

Basic cache unit

according to Alluxio Experience and Meta The team's early experiments , It is most effective to read, write, and clean data with a fixed block size . To reduce the storage and service pressure of metadata services ,Alluxio The default cache data block size in the system is 64MB. Because of the Alluxio The data cache only needs to manage data and metadata locally , So we have greatly reduced the granularity of the cache , Set the default cache granularity to a size of 1MB Of “page（ page ）”.

Cache location and hierarchy

Alluxio The local cache caches data to the local file system by default . Per cache page Stored as a separate file in a directory , The directory structure is as follows ：<BASE_DIR>/LOCAL/1048576/<BUCKET>/<PAGE>

* BASE_DIR Is the root directory of cache storage , Can pass Presto Configuration item for “cache.base-directory ” To set up .

* LOCAL Indicates that the cache storage type is LOCAL（ Local ）.Alluxio Also support RocksDB Store as a cache .

* 1048576: The size of the representative data block is 1MB.

* BUCKET As various page files bucket The catalog of . Why create bucket To ensure that there are not too many files in a single directory , Otherwise, it may lead to poor performance .

* PAGE Stands for page ID Named file . stay Presto in ,ID Is the name of the file md5 Hash value .

Threads concurrent

Every Presto worker Contains a set of threads , Each thread performs different query tasks , But share the same data cache . therefore , The Alluxio Data caching requires high concurrency across threads , To provide high throughput . in other words , The data cache needs to allow multiple threads to obtain the same data concurrently page, It also ensures thread safety when clearing cached data .

Cache recovery

When worker start-up （ Or restart ） when ,Alluxio The local cache attempts to reuse cached data that already exists in the local cache directory . If the cache directory structure is compatible , The cached data will be reused .

monitor

Alluxio When performing various cache related operations , It can output all kinds of JMX indicators . Through these indicators , The system administrator can easily monitor the cache usage of the entire cluster .

The benchmark

We benchmark queries running on a production cluster , The cluster is used as a test cluster .

Number of queries ：17320 The cluster size ：600 Maximum cache capacity per node ：460GB Clear strategy ：LRU

Cache block size ：1MB, Means that the data is pressed 1MB Read the size of 、 Store and clear .

Query execution time optimization （ Company ： millisecond ）：

As can be seen from the table , Query latency has improved significantly , among P50 The query latency is reduced 33%,P75 To reduce the 54%,P95 To reduce the 48%.

Cost savings

* Master Total data read size during branch execution ：582 TB

* Total data read size of cache branch execution process ：251 TB

Saved scanning data amount ：57%

cache hit rate

During the experiment , The cache hit rate is basically stable at a high level , Most of the time, it is maintained at 0.9 and 1 Between . In the middle, there may be some decline due to new queries scanning a large amount of new data . We need to add some other algorithms to prevent less frequently accessed data blocks from being cached more easily than frequently accessed data blocks .

How to use ？

Before using data cache , The first thing we need to do is to enable the soft affinity scheduling policy , The data cache does not support the random node scheduling policy .

To enable the soft affinity scheduling policy , stay coordinator The following configuration is required in ：

"hive.node-selection-strategy", "SOFT_AFFINITY”

To use the default （ Random ） Node scheduling strategy , The settings are as follows ：

"hive.node-selection-strategy", "NO_PREFERENCE”

To enable the Alluxio Data caching , stay worker The nodes are configured as follows :

* Enable worker Data cache on node => "cache.enabled", "true"

* Set the data cache type to Alluxio=> "cache.type", "ALLUXIO"

* Set the base directory of the cache data store (base directory) => "cache.base-directory", "file:///cache"

* Configure individual worker The maximum amount of data that the cache can use :"cache.alluxio.max-cache-size", "500GB"

Other useful configurations are as follows

Coordinator To configure （ Can be used to configure busy worker The definition of ）：

* Set the maximum number of pending slices for a single task ：node-scheduler.max-pending-splits-per-task

* Set the maximum number of slices per node ：node-scheduler.max-splits-per-node

Worker To configure ：

* Enable Alluxio Cache indicators （ Default ：true）: cache.alluxio.metrics-enabled

* Alluxio Cache metrics JMX Class name ( Default : alluxio.metrics.sink.JmxSink): cache.alluxio.metrics-enabled

* Alluxio The index domain name used by the cache ：( Default : com.facebook.alluxio): cache.alluxio.metrics-domain

* Alluxio Whether the cache should write to the cache asynchronously ( Default : false): cache.alluxio.async-write-enabled

* Alluxio Whether the cache should verify the existing configuration ( Default : false): cache.alluxio.config-validation-enabled

Alluxio The data cache outputs various... For its cache operations JMX indicators . Click to view the complete list of indicator names .

Next work

* Pass the speed limiter （rate limiter） Controls the rate of cache write operations , To avoid flash memory durability problems ;

* Implement semantic aware caching , Improve cache efficiency ;* Establish a mechanism to clean up the cache directory , For maintenance or restart .

* Executable “dry run” Pattern

* Ability to impose various capacity usage restrictions , For example, the single table cache quota limit , Single partition cache quota limit or single mode cache quota limit .

* Build stronger worker Node scheduling mechanism .

* Add new algorithm implementation , Prevent data blocks that are accessed less frequently from being cached more easily than data blocks that are accessed more frequently .

* Fault tolerance ： at present , When the number of nodes in the cluster changes , The hash based node scheduling algorithm will have problems . We are trying to implement more robust algorithms , Such as consistency hash .

* Better load balancing ： When we take other factors such as slice size 、 After taking node resources into account , We can better define “ Be busy ” node , Then make more intelligent decisions in load balancing .

* Affinity criteria ： at present ,Presto The affinity granularity within the cluster is file level . If the best performance cannot be achieved under this granularity standard , We need to adjust our affinity criteria , Make it more granular , And find a balance between load balancing and good cache hit rate , To achieve better overall performance .

* Improve Alluxio Resource utilization of cache Library .

Article contributors ：🧔

Meta:Rohit Jain, James Sun, Ke Wang, Shixuan Fan, Biswapesh Chattopadhyay, Baldeep HiraAlluxio: Bin Fan, Calvin Jia, Haoyuan Li
The original content is published in ：2020/6/16

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206101421456933.html