当前位置:网站首页>DVC use case (VI): Data Registry
DVC use case (VI): Data Registry
2022-07-04 11:31:00 【Li Guodong】
DVC One of the main uses of repositories is Version control of data and model files .DVC It also supports the reuse of these data artifacts across projects . This means that your project can rely on others DVC Repository data , Just like a package management system for data science .
We can build a dedicated dataset ( Or data characteristics 、ML Model, etc ) Version control DVC project .
GIT The remote warehouse has all metadata and change history of the data it tracks . We can see who changed what when , And use pull
Request to update data , Just like we use code . That's what we're talking about Data registry —— Data management middleware between machine learning project and cloud storage .
Advantages of data registry :
- Reusability : Easy to use CLI(
dvc get
anddvc import
command , Be similar to pip And other software package management systems ) Reproduce and organize feature warehouse (Feature Store). - persistence : from DVC Remote storage of registry tracking ( for example :S3 bucket ) Improved data security . for example , Someone delete or rewrite ML The model is less likely .
- Storage optimization : Data shared by multiple projects Concentrate in one place . This simplifies data management and optimizes space requirements .
- Data is code ( Managing data is like managing code ): Leverage for your data and model lifecycle Git workflow , for example , Submit 、 Branch 、 Pull request 、 review (review), even to the extent that CI/CD.
- Security : The registry can be set to use read-only remote storage ( for example :HTTP The server ).
Establish data registry
Adding datasets to the registry is very simple , Just put the data file or directory in the workspace , And use dvc add
Track it . You can use .dvc
file ( for example , Below music/songs.dvc
file ) Follow the routine Git Workflow . This allows the team to collaborate on data at the same level as the source code :
This Sample data set Real existence , Please download in advance , Once the download is complete , Do the following :
$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
The actual data is stored in the cache of the project , And it can be pushed to one or more remote storage locations , So that others can access the registry from other locations .
$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push
take DVC A good way to organize repositories into data registries is Use directories to group similar data , for example :images/
, natural-language/
etc.
such as , our Dataset registry have get-started/
and use-cases/
Such as catalog , And The website Part of the content matches .
Use the data registry
The main way to use artifacts in the data registry is dvc import
and dvc get
command , as well as Python API dvc.api
. First , We may want to explore its content .
List the data
To explore DVC Repository content to search for appropriate data , Please use dvc list
command ( Be similar to ls
Commands and third-party tools , Such as :aws s3 ls
):
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...
The above command lists Git Tracked files and DVC Tracked data ( Or models ).
Download the data to the working directory
dvc get
It's similar to using wget
(HTTP)、aws s3 cp
(S3) And other direct download tools . From you to DVC Get datasets from the repository , We can run the following command :
$ dvc get https://github.com/example/registry music/songs
This will be downloaded from the project's default remote data repository music/songs
, And place it in the current working directory .
Import data into workflow
dvc import
Use with dvc get
The same grammar :
$ dvc import https://github.com/example/registry images/faces
In addition to downloading data , Importing also saves the local project to the data source ( Registry repository ) Dependency information . This is done by generating special Import .dvc
file ( Contains metadata information ) To achieve .
Whenever the data set in the registry changes , We can all use dvc update
Update data :
dvc update faces.dvc
This will download new and changed files according to the latest submission in the source code base , And delete the deleted files ; meanwhile , It will also be updated accordingly .dvc
file .
Be careful :
dvc get
、dvc import
anddvc update
There is one--rev
Parameter options , For downloading data from a specific submission in the source repository .
Use Python The code download DVC data
our Python API Included with DVC Installed together dvc In bag , Including from the outside DVC Open functions for project loading and direct streaming data :
import dvc.api.open
model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'
with dvc.api.open(model_path, repo_url) as fd:
model = pickle.load(fd)
# ... Use the model!
This will model.pkl
Open as a file descriptor . This example demonstrates a simple ML Model deployment method , But it can be extended to more advanced scenes , for example : One model zoo
.
besides , You can also refer to dvc.api.read()
and dvc.api.get_url()
function .
Update the data registry
Data sets are constantly updated ,DVC It's easy to deal with it . Just change the data in the registry . By running again dvc add
To apply the update .
$ cp 1000/more/images/* music/songs/
$ dvc add music/songs/
DVC Modify the corresponding .dvc
File to reflect the changes , It's going to be Git collect :
$ git status
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
Repeating this process for multiple datasets can form a robust registry . The result is basically a library versioning a set of metafiles .
Let's take an example :
$ tree --filelimit=10
.
├── images
│ ├── .gitignore
│ ├── cats-dogs [2800 entries] # Listed in .gitignore
│ ├── faces [10000 entries] # Listed in .gitignore
│ ├── cats-dogs.dvc
│ └── faces.dvc
├── music
│ ├── .gitignore
│ ├── songs [11000 entries] # Listed in .gitignore
│ └── songs.dvc
├── text
...
Don't forget to use dvc push
Push data changes to Remote storage , So others can get these changes !
$ dvc push
边栏推荐
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 14
- Canoe: the difference between environment variables and system variables
- Reptile learning 4 winter vacation series (3)
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 8
- Solaris 10网络服务
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 18
- Canoe - the second simulation engineering - xvehicle - 2panel design (principle, idea)
- Digital simulation beauty match preparation -matlab basic operation No. 6
- Safety testing aspects
猜你喜欢
Post man JSON script version conversion
os. Path built-in module
Elevator dispatching (pairing project) ③
(August 10, 2021) web crawler learning - Chinese University ranking directed crawler
Ultimate bug finding method - two points
Elevator dispatching (pairing project) ②
DDS-YYDS
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
2021 annual summary - it seems that I have done everything except studying hard
Canoe: what is vtsystem
随机推荐
Some summaries of the 21st postgraduate entrance examination 823 of network security major of Shanghai Jiaotong University and ideas on how to prepare for the 22nd postgraduate entrance examination pr
20 kinds of hardware engineers must be aware of basic components | the latest update to 8.13
iptables导致Heartbeat脑裂
Function parameters (positional parameters, default value parameters, variable parameters, named keyword parameters, keyword parameters)
Elevator dispatching (pairing project) ③
试题库管理系统–数据库设计[通俗易懂]
DDS-YYDS
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
Simple understanding of seesion, cookies, tokens
How to deal with the relationship between colleagues
Introduction to Lichuang EDA
C language compilation process
Test question bank management system - database design [easy to understand]
Safety testing aspects
SQL greatest() function instance detailed example
Local MySQL forget password modification method (Windows) [easy to understand]
Canoe - description of common database attributes
Simple understanding of generics
3W word will help you master the C language as soon as you get started - the latest update is up to 5.22
LVS+Keepalived实现四层负载及高可用