当前位置:网站首页>DVC use case (VI): Data Registry
DVC use case (VI): Data Registry
2022-07-04 11:31:00 【Li Guodong】
DVC One of the main uses of repositories is Version control of data and model files .DVC It also supports the reuse of these data artifacts across projects . This means that your project can rely on others DVC Repository data , Just like a package management system for data science .

We can build a dedicated dataset ( Or data characteristics 、ML Model, etc ) Version control DVC project .
GIT The remote warehouse has all metadata and change history of the data it tracks . We can see who changed what when , And use pull Request to update data , Just like we use code . That's what we're talking about Data registry —— Data management middleware between machine learning project and cloud storage .
Advantages of data registry :
- Reusability : Easy to use CLI(
dvc getanddvc importcommand , Be similar to pip And other software package management systems ) Reproduce and organize feature warehouse (Feature Store). - persistence : from DVC Remote storage of registry tracking ( for example :S3 bucket ) Improved data security . for example , Someone delete or rewrite ML The model is less likely .
- Storage optimization : Data shared by multiple projects Concentrate in one place . This simplifies data management and optimizes space requirements .
- Data is code ( Managing data is like managing code ): Leverage for your data and model lifecycle Git workflow , for example , Submit 、 Branch 、 Pull request 、 review (review), even to the extent that CI/CD.
- Security : The registry can be set to use read-only remote storage ( for example :HTTP The server ).
Establish data registry
Adding datasets to the registry is very simple , Just put the data file or directory in the workspace , And use dvc add Track it . You can use .dvc file ( for example , Below music/songs.dvc file ) Follow the routine Git Workflow . This allows the team to collaborate on data at the same level as the source code :
This Sample data set Real existence , Please download in advance , Once the download is complete , Do the following :
$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
The actual data is stored in the cache of the project , And it can be pushed to one or more remote storage locations , So that others can access the registry from other locations .
$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push
take DVC A good way to organize repositories into data registries is Use directories to group similar data , for example :images/, natural-language/ etc.
such as , our Dataset registry have get-started/ and use-cases/ Such as catalog , And The website Part of the content matches .
Use the data registry
The main way to use artifacts in the data registry is dvc import and dvc get command , as well as Python API dvc.api. First , We may want to explore its content .
List the data
To explore DVC Repository content to search for appropriate data , Please use dvc list command ( Be similar to ls Commands and third-party tools , Such as :aws s3 ls):
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...
The above command lists Git Tracked files and DVC Tracked data ( Or models ).
Download the data to the working directory
dvc get It's similar to using wget (HTTP)、aws s3 cp (S3) And other direct download tools . From you to DVC Get datasets from the repository , We can run the following command :
$ dvc get https://github.com/example/registry music/songs
This will be downloaded from the project's default remote data repository music/songs, And place it in the current working directory .
Import data into workflow
dvc import Use with dvc get The same grammar :
$ dvc import https://github.com/example/registry images/faces
In addition to downloading data , Importing also saves the local project to the data source ( Registry repository ) Dependency information . This is done by generating special Import .dvc file ( Contains metadata information ) To achieve .
Whenever the data set in the registry changes , We can all use dvc update Update data :
dvc update faces.dvc
This will download new and changed files according to the latest submission in the source code base , And delete the deleted files ; meanwhile , It will also be updated accordingly .dvc file .
Be careful :
dvc get、dvc importanddvc updateThere is one--revParameter options , For downloading data from a specific submission in the source repository .
Use Python The code download DVC data
our Python API Included with DVC Installed together dvc In bag , Including from the outside DVC Open functions for project loading and direct streaming data :
import dvc.api.open
model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'
with dvc.api.open(model_path, repo_url) as fd:
model = pickle.load(fd)
# ... Use the model!
This will model.pkl Open as a file descriptor . This example demonstrates a simple ML Model deployment method , But it can be extended to more advanced scenes , for example : One model zoo.
besides , You can also refer to dvc.api.read() and dvc.api.get_url() function .
Update the data registry
Data sets are constantly updated ,DVC It's easy to deal with it . Just change the data in the registry . By running again dvc add To apply the update .
$ cp 1000/more/images/* music/songs/
$ dvc add music/songs/
DVC Modify the corresponding .dvc File to reflect the changes , It's going to be Git collect :
$ git status
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
Repeating this process for multiple datasets can form a robust registry . The result is basically a library versioning a set of metafiles .
Let's take an example :
$ tree --filelimit=10
.
├── images
│ ├── .gitignore
│ ├── cats-dogs [2800 entries] # Listed in .gitignore
│ ├── faces [10000 entries] # Listed in .gitignore
│ ├── cats-dogs.dvc
│ └── faces.dvc
├── music
│ ├── .gitignore
│ ├── songs [11000 entries] # Listed in .gitignore
│ └── songs.dvc
├── text
...
Don't forget to use dvc push Push data changes to Remote storage , So others can get these changes !
$ dvc push
边栏推荐
- C language memory layout
- Daemon xinted and logging syslogd
- QQ get group member operation time
- Oracle11g | getting started with database. It's enough to read this 10000 word analysis
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 16
- LxC shared directory permission configuration
- 2021-08-09
- OSI seven layer model & unit
- Cacti主机模板之定制版
- Notes on writing test points in mind mapping
猜你喜欢

Canoe - the third simulation project - bus simulation-1 overview

Some summaries of the 21st postgraduate entrance examination 823 of network security major of Shanghai Jiaotong University and ideas on how to prepare for the 22nd postgraduate entrance examination pr

20 kinds of hardware engineers must be aware of basic components | the latest update to 8.13
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 8](/img/16/33f5623625ba817e6e022b5cb7ff5d.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 8

Ternsort model integration summary

QQ group administrators

Canoe - description of common database attributes
![[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 14](/img/c5/dde92f887e8e73d7db869fcddc107f.jpg)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 14

Canoe - the third simulation project - bus simulation - 3-1 project implementation

Games101 Lesson 8 shading 2 Notes
随机推荐
Dos and path
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 5
Canoe - the second simulation project -xvihicle1 bus database design (operation)
Alibaba cloud server connection intranet operation
Common tips
Number and math classes
Reptile learning 3 (winter vacation learning)
Simple understanding of seesion, cookies, tokens
IO stream ----- open
regular expression
os. Path built-in module
Analysis function in SQL
Install freeradius3 in the latest version of openwrt
Shift EC20 mode and switch
Enter the smart Park, and change begins here
Lvs+kept realizes four layers of load and high availability
Summary of collection: (to be updated)
Summary of Shanghai Jiaotong University postgraduate entrance examination module firewall technology
MBG combat zero basis
Post man JSON script version conversion