当前位置:网站首页>DVC use case (VI): Data Registry
DVC use case (VI): Data Registry
2022-07-04 11:31:00 【Li Guodong】
DVC One of the main uses of repositories is Version control of data and model files .DVC It also supports the reuse of these data artifacts across projects . This means that your project can rely on others DVC Repository data , Just like a package management system for data science .

We can build a dedicated dataset ( Or data characteristics 、ML Model, etc ) Version control DVC project .
GIT The remote warehouse has all metadata and change history of the data it tracks . We can see who changed what when , And use pull Request to update data , Just like we use code . That's what we're talking about Data registry —— Data management middleware between machine learning project and cloud storage .
Advantages of data registry :
- Reusability : Easy to use CLI(
dvc getanddvc importcommand , Be similar to pip And other software package management systems ) Reproduce and organize feature warehouse (Feature Store). - persistence : from DVC Remote storage of registry tracking ( for example :S3 bucket ) Improved data security . for example , Someone delete or rewrite ML The model is less likely .
- Storage optimization : Data shared by multiple projects Concentrate in one place . This simplifies data management and optimizes space requirements .
- Data is code ( Managing data is like managing code ): Leverage for your data and model lifecycle Git workflow , for example , Submit 、 Branch 、 Pull request 、 review (review), even to the extent that CI/CD.
- Security : The registry can be set to use read-only remote storage ( for example :HTTP The server ).
Establish data registry
Adding datasets to the registry is very simple , Just put the data file or directory in the workspace , And use dvc add Track it . You can use .dvc file ( for example , Below music/songs.dvc file ) Follow the routine Git Workflow . This allows the team to collaborate on data at the same level as the source code :
This Sample data set Real existence , Please download in advance , Once the download is complete , Do the following :
$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
The actual data is stored in the cache of the project , And it can be pushed to one or more remote storage locations , So that others can access the registry from other locations .
$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push
take DVC A good way to organize repositories into data registries is Use directories to group similar data , for example :images/, natural-language/ etc.
such as , our Dataset registry have get-started/ and use-cases/ Such as catalog , And The website Part of the content matches .
Use the data registry
The main way to use artifacts in the data registry is dvc import and dvc get command , as well as Python API dvc.api. First , We may want to explore its content .
List the data
To explore DVC Repository content to search for appropriate data , Please use dvc list command ( Be similar to ls Commands and third-party tools , Such as :aws s3 ls):
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...
The above command lists Git Tracked files and DVC Tracked data ( Or models ).
Download the data to the working directory
dvc get It's similar to using wget (HTTP)、aws s3 cp (S3) And other direct download tools . From you to DVC Get datasets from the repository , We can run the following command :
$ dvc get https://github.com/example/registry music/songs
This will be downloaded from the project's default remote data repository music/songs, And place it in the current working directory .
Import data into workflow
dvc import Use with dvc get The same grammar :
$ dvc import https://github.com/example/registry images/faces
In addition to downloading data , Importing also saves the local project to the data source ( Registry repository ) Dependency information . This is done by generating special Import .dvc file ( Contains metadata information ) To achieve .
Whenever the data set in the registry changes , We can all use dvc update Update data :
dvc update faces.dvc
This will download new and changed files according to the latest submission in the source code base , And delete the deleted files ; meanwhile , It will also be updated accordingly .dvc file .
Be careful :
dvc get、dvc importanddvc updateThere is one--revParameter options , For downloading data from a specific submission in the source repository .
Use Python The code download DVC data
our Python API Included with DVC Installed together dvc In bag , Including from the outside DVC Open functions for project loading and direct streaming data :
import dvc.api.open
model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'
with dvc.api.open(model_path, repo_url) as fd:
model = pickle.load(fd)
# ... Use the model!
This will model.pkl Open as a file descriptor . This example demonstrates a simple ML Model deployment method , But it can be extended to more advanced scenes , for example : One model zoo.
besides , You can also refer to dvc.api.read() and dvc.api.get_url() function .
Update the data registry
Data sets are constantly updated ,DVC It's easy to deal with it . Just change the data in the registry . By running again dvc add To apply the update .
$ cp 1000/more/images/* music/songs/
$ dvc add music/songs/
DVC Modify the corresponding .dvc File to reflect the changes , It's going to be Git collect :
$ git status
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
Repeating this process for multiple datasets can form a robust registry . The result is basically a library versioning a set of metafiles .
Let's take an example :
$ tree --filelimit=10
.
├── images
│ ├── .gitignore
│ ├── cats-dogs [2800 entries] # Listed in .gitignore
│ ├── faces [10000 entries] # Listed in .gitignore
│ ├── cats-dogs.dvc
│ └── faces.dvc
├── music
│ ├── .gitignore
│ ├── songs [11000 entries] # Listed in .gitignore
│ └── songs.dvc
├── text
...
Don't forget to use dvc push Push data changes to Remote storage , So others can get these changes !
$ dvc push
边栏推荐
- Simple understanding of string
- Enter the smart Park, and change begins here
- VPS安装Virtualmin面板
- F12 clear the cookies of the corresponding web address
- Understanding of object
- How to deal with the relationship between colleagues
- SSH原理和公钥认证
- Canoe - description of common database attributes
- Function parameters (positional parameters, default value parameters, variable parameters, named keyword parameters, keyword parameters)
- C language memory layout
猜你喜欢

2021-08-09

Elevator dispatching (pairing project) ②

Elevator dispatching (pairing project) ③
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 12](/img/b1/926d9b3d7ce9c5104f3e81974eef07.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 12

Climb Phoenix Mountain on December 19, 2021
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 9](/img/ed/0edff23fbd3880bc6c9dabd31755ac.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 9

TCP slicing and PSH understanding

Replace() function

Using terminal connection in different modes of virtual machine

Alibaba cloud server connection intranet operation
随机推荐
Summary of Shanghai Jiaotong University postgraduate entrance examination module firewall technology
netstat
Canoe - the third simulation project - bus simulation - 3-2 project implementation
Reptile learning 4 winter vacation learning series (1)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 16
Canoe the second simulation engineering xvehicle 3 CAPL programming (operation)
Simple understanding of seesion, cookies, tokens
Canoe - the third simulation project - bus simulation-1 overview
How to judge the advantages and disadvantages of low code products in the market?
Postman advanced
Iptables cause heartbeat brain fissure
(August 10, 2021) web crawler learning - Chinese University ranking directed crawler
Oracle11g | getting started with database. It's enough to read this 10000 word analysis
regular expression
Process communication and thread explanation
Climb Phoenix Mountain on December 19, 2021
Function parameters (positional parameters, default value parameters, variable parameters, named keyword parameters, keyword parameters)
Canoe - the second simulation project -xvihicle1 bus database design (operation)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 15
Canoe: what is vtsystem