当前位置:网站首页>DVC use case (VI): Data Registry
DVC use case (VI): Data Registry
2022-07-04 11:31:00 【Li Guodong】
DVC One of the main uses of repositories is Version control of data and model files .DVC It also supports the reuse of these data artifacts across projects . This means that your project can rely on others DVC Repository data , Just like a package management system for data science .

We can build a dedicated dataset ( Or data characteristics 、ML Model, etc ) Version control DVC project .
GIT The remote warehouse has all metadata and change history of the data it tracks . We can see who changed what when , And use pull Request to update data , Just like we use code . That's what we're talking about Data registry —— Data management middleware between machine learning project and cloud storage .
Advantages of data registry :
- Reusability : Easy to use CLI(
dvc getanddvc importcommand , Be similar to pip And other software package management systems ) Reproduce and organize feature warehouse (Feature Store). - persistence : from DVC Remote storage of registry tracking ( for example :S3 bucket ) Improved data security . for example , Someone delete or rewrite ML The model is less likely .
- Storage optimization : Data shared by multiple projects Concentrate in one place . This simplifies data management and optimizes space requirements .
- Data is code ( Managing data is like managing code ): Leverage for your data and model lifecycle Git workflow , for example , Submit 、 Branch 、 Pull request 、 review (review), even to the extent that CI/CD.
- Security : The registry can be set to use read-only remote storage ( for example :HTTP The server ).
Establish data registry
Adding datasets to the registry is very simple , Just put the data file or directory in the workspace , And use dvc add Track it . You can use .dvc file ( for example , Below music/songs.dvc file ) Follow the routine Git Workflow . This allows the team to collaborate on data at the same level as the source code :
This Sample data set Real existence , Please download in advance , Once the download is complete , Do the following :
$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
The actual data is stored in the cache of the project , And it can be pushed to one or more remote storage locations , So that others can access the registry from other locations .
$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push
take DVC A good way to organize repositories into data registries is Use directories to group similar data , for example :images/, natural-language/ etc.
such as , our Dataset registry have get-started/ and use-cases/ Such as catalog , And The website Part of the content matches .
Use the data registry
The main way to use artifacts in the data registry is dvc import and dvc get command , as well as Python API dvc.api. First , We may want to explore its content .
List the data
To explore DVC Repository content to search for appropriate data , Please use dvc list command ( Be similar to ls Commands and third-party tools , Such as :aws s3 ls):
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...
The above command lists Git Tracked files and DVC Tracked data ( Or models ).
Download the data to the working directory
dvc get It's similar to using wget (HTTP)、aws s3 cp (S3) And other direct download tools . From you to DVC Get datasets from the repository , We can run the following command :
$ dvc get https://github.com/example/registry music/songs
This will be downloaded from the project's default remote data repository music/songs, And place it in the current working directory .
Import data into workflow
dvc import Use with dvc get The same grammar :
$ dvc import https://github.com/example/registry images/faces
In addition to downloading data , Importing also saves the local project to the data source ( Registry repository ) Dependency information . This is done by generating special Import .dvc file ( Contains metadata information ) To achieve .
Whenever the data set in the registry changes , We can all use dvc update Update data :
dvc update faces.dvc
This will download new and changed files according to the latest submission in the source code base , And delete the deleted files ; meanwhile , It will also be updated accordingly .dvc file .
Be careful :
dvc get、dvc importanddvc updateThere is one--revParameter options , For downloading data from a specific submission in the source repository .
Use Python The code download DVC data
our Python API Included with DVC Installed together dvc In bag , Including from the outside DVC Open functions for project loading and direct streaming data :
import dvc.api.open
model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'
with dvc.api.open(model_path, repo_url) as fd:
model = pickle.load(fd)
# ... Use the model!
This will model.pkl Open as a file descriptor . This example demonstrates a simple ML Model deployment method , But it can be extended to more advanced scenes , for example : One model zoo.
besides , You can also refer to dvc.api.read() and dvc.api.get_url() function .
Update the data registry
Data sets are constantly updated ,DVC It's easy to deal with it . Just change the data in the registry . By running again dvc add To apply the update .
$ cp 1000/more/images/* music/songs/
$ dvc add music/songs/
DVC Modify the corresponding .dvc File to reflect the changes , It's going to be Git collect :
$ git status
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
Repeating this process for multiple datasets can form a robust registry . The result is basically a library versioning a set of metafiles .
Let's take an example :
$ tree --filelimit=10
.
├── images
│ ├── .gitignore
│ ├── cats-dogs [2800 entries] # Listed in .gitignore
│ ├── faces [10000 entries] # Listed in .gitignore
│ ├── cats-dogs.dvc
│ └── faces.dvc
├── music
│ ├── .gitignore
│ ├── songs [11000 entries] # Listed in .gitignore
│ └── songs.dvc
├── text
...
Don't forget to use dvc push Push data changes to Remote storage , So others can get these changes !
$ dvc push
边栏推荐
- Force buckle 142 Circular linked list II
- 2021-11-02
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
- Oracle11g | getting started with database. It's enough to read this 10000 word analysis
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 8
- LVS+Keepalived实现四层负载及高可用
- SSH原理和公钥认证
- QQ get group settings
- Replace() function
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 14
猜你喜欢

OSI model notes

Alibaba cloud server connection intranet operation
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11](/img/6a/398d9cceecdd9d7c9c4613d8b5ca27.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11

Global function Encyclopedia

Summary of Shanghai Jiaotong University postgraduate entrance examination module firewall technology

netstat
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 12](/img/b1/926d9b3d7ce9c5104f3e81974eef07.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 12

SQL greatest() function instance detailed example

F12 clear the cookies of the corresponding web address

Canoe - the second simulation project -xvihicle1 bus database design (operation)
随机推荐
If function in SQL
template<typename MAP, typename LIST, typename First, typename ... Keytypes > recursive call with indefinite parameters - beauty of Pan China
(August 9, 2021) example exercise of air quality index calculation (I)
Btrace tells you how to debug online without restarting the JVM
How to deal with the relationship between colleagues
Elevator dispatching (pairing project) ②
LxC shared directory addition and deletion
TCP slicing and PSH understanding
(August 10, 2021) web crawler learning - Chinese University ranking directed crawler
Replace() function
Capl: timer event
Elevator dispatching (pairing project) ④
IO stream ----- open
VPS installation virtualmin panel
Function parameters (positional parameters, default value parameters, variable parameters, named keyword parameters, keyword parameters)
Games101 Lesson 8 shading 2 Notes
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
Common built-in modules
2021-10-20
Day01 preliminary packet capture