Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part 2)

Fluid + Alluxio It introduces a new architecture for clustering , However, we still encounter some problems in specific scene adaptation , These questions we first feedback with the community , The community solved our needs at the first time , Here are some important features to support ：
hostpath And nonroot Support for

stay Atlas In the platform , We have set up... For the distributed file system nonroot, Client's root You do not have permission to operate the user's directory . If in Alluxio The cache engine uses root Users access the underlying system UFS There will be authority issues ,Fluid It also provides nonroot Support , It supports setting the user's password in the cache engine and dataset respectively UID And GID, The user information of the cache engine is guaranteed Allluxio Can successfully read to the bottom layer UFS The data of , If the user sets the same... In the dataset UID And GID The task side data can be read , If the data set UID And GID Set to other user information , Data set sharing can be realized , This feature solves the problems related to permission control and data storage redundancy encountered by the platform .
Support for multiple mount points

Because the data of user's task is usually composed of different data sets , Different data sets here can be different directories of the same storage system or different storage systems .Alluxio It can provide a unified namespace for applications . Through the abstraction of unified namespace , Applications can access multiple independent storage systems through unified namespaces and interfaces . Instead of connecting each independent storage system for communication , Applications can only connect to Alluxio , adopt Alluxiofuse Enable users to use POXIS Interface to access cached data stored at different underlying levels .

The transparent naming mechanism ensures Alluxio Consistent with the namespace identity of the underlying storage system . Different underlying storage directories and file names can be stored in Alluxio mapping .

Based on this feature, the user can be in the same training task at the same time 2 Cache data from multiple storage systems . This feature can prevent users from doing a lot of data migration , stay Atlas In the platform ,TB Compression and packaging of small files of order of magnitude 、 Migration and decompression takes hours , Using this feature, users only need to change the storage path of the data in the next task without modifying the source code , Ready to run program .

Cache preheating

Computing resources in platforms are often more scarce than storage resources , If the user starts TB Level small file training task , The synchronization of metadata and data between the underlying storage system and the cache system takes a lot of time .Alluxio Provides loadMetadata And loaddata The function of ,Fluid take 2 Three functions are integrated , Users can pull the data from the remote storage system to the distributed cache engine near the computing node in advance , So that applications consuming the dataset can enjoy the acceleration effect brought by caching when they run for the first time . This function can effectively increase the number of clusters GPU utilization , Avoid metadata synchronization when caching for the first time , The resulting time-consuming , It makes the program have a good performance at the beginning of running IO Read speed , Holistic GPU Utilization has increased .
Parameter tuning

Alluxio More tuning parameters are provided , The platform configures and optimizes the corresponding parameters according to the characteristics of the business , For almost all reading scenes , Some general parameters and some tuning for different data sets are carried out .

General parameters ：
open kernel_cache And will be alluxio.user.metadata.cache.enabled Set to true, Open the file and directory metadata cache on the client . For the full read scenario, you can configure alluxio.user.metadata.cache.max.size and alluxio.user.metadata.cache.expiration.time To adjust the maximum cache file and directory metadata cache amount and effective time .

By setting alluxio.user.file.passive.cache.enabled=false And alluxio.user.file.readtype.default=CACHE To avoid frequent evictions （Cache Eviction） Cause cache jitter .

We divide the business into... According to the size of the data set 3 Kind of , The first is a small file , The size of a single file is 1M Following . The second is medium and large data, with a data volume of several hundred G Left and right , The third is T Level large file .
Speech noise reduction scene

The model of this test is based on Pytorch It's framed DLSE Model , The number of data files is 50 All around , The total data size is 183 GB, Memory is used as Alluxio Cache media .
This experiment uses a single machine 10 Card's task , be based on Pytorch Native DDP Framework for multi card communication , The comparison test directly from the distributed file system 、 from Alluxio Cache read and a round of warm-up before starting from Alluxio The experiment of .

As can be seen from the time of the first round , the warmup The caching task is compared to reading directly from the underlying file system or Alluxio The speed of the first round of reading is close to 10 Double the speed .Alluxio In the first round of training, metadata synchronization and cache data are required for data , Therefore, the advantage of caching in the first round of data reading is not reflected . But when the second round of reading , Because all the data has fallen into the cache medium , At this time, the test is Alluxio Its cache hit rate , It can also be seen from the above experimental results , The growth rate is very obvious .

After the data reading efficiency is improved , whole GPU Utilization has also been improved , By monitoring GPU Utilization found using WarmUp Of Alluxio cache ,GPU The utilization rate of is basically stable at 90% about , At the same time, we see that data caching in memory can effectively reduce the load of the underlying storage .

This experiment is based on CRNN Character recognition model , use Pytorch The framework is used to build the model , The data source is self collected 125GB Image data , The image data is converted into a lmdb The large files , We did 3 A comparative test , Read directly from the underlying file system 、 Never warmed up Alluxio Read and use preheated Alluxio Read .

We found that using preheating Alluxio Node IO Bandwidth traffic is compared to reading directly from the underlying distributed storage 1300Mb/s Reduced to basic 0, For our platform, the revenue is very large , Without adding the underlying storage hardware , This is the fastest and relatively inexpensive way to reduce storage system bandwidth usage .

Reading the cache is relative to directly reading the underlying storage computing node GPU The average utilization rate is determined by 69.59% Upgrade to 91.46%, Indicates the elimination of I/O The bottleneck can improve the resource utilization efficiency of large file training tasks .

By introducing Fluid + Alluxio The new architecture of , The platform has achieved a series of benefits .

Speed up model training ： Through the above test results, we can see that the speed-up effect on the task is very obvious , It can directly take advantage of the speed advantage of local storage to avoid competition between network transmission and resources , So as to effectively accelerate the time-consuming of data reading in the process of model training .

Reduce the underlying storage load ： The new architecture can share the bandwidth and bandwidth of the underlying storage system through local cache IOPS pressure , Significantly reduce the load of the underlying storage system , Effectively improve the availability of the underlying storage system .

Add clusters GPU utilization ： Through efficient IO Read , Eliminate the bottleneck of user program data reading , Avoided GPU Idling waiting for data , Improved GPU Utilization ratio , This improves the efficiency of the whole cluster GPU Usage rate .

Avoid the same node IO competition ： The new architecture fully resolves the same node we encountered earlier IO Resource competition 、 The storage system has a bandwidth bottleneck and the training efficiency of the model is not high .

More efficient cache management ： Adopt a new architecture to manage cache in a more cloud native way , Engineers have changed from simply loading data into memory to caching into resources that can be managed and monitored ,Kubernetes Scheduling is cache aware , Make corresponding policy allocation , Make the task run more efficiently .

Fluid + Alluxio It has brought us great benefits , At present, we are also working closely and continuously with the community , In the follow-up, we will continue to conduct in-depth research in the following aspects ：

Continuously feed back test results and problems and provide richer usage scenarios to the community , Continuous iterative optimization Alluxio Performance of ;

Summarize and test more data types , Provide feedback on parameter tuning practices to the community ;

increase fluid Cache intelligent scheduling function .

当前位置：网站首页>Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part 2)

Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part 2)

边栏推荐

猜你喜欢

随机推荐