brainsteam.co.uk/brainsteam/content/posts/legacy/2020-11-27-dvc-and-backblaz...

162 lines
15 KiB
Markdown

---
author: James
date: 2020-11-27 15:43:48+00:00
featured_image: /wp-content/uploads/2020/11/pexels-panumas-nikhomkhai-1148820-825x510.jpg
medium_post:
- O:11:"Medium_Post":11:{s:16:"author_image_url";s:69:"https://cdn-images-1.medium.com/fit/c/200/200/0*naYvMn9xdbL5qlkJ.jpeg";s:10:"author_url";s:30:"https://medium.com/@jamesravey";s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:3:"yes";s:2:"id";s:12:"d44d231b648f";s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";s:103:"https://medium.com/@jamesravey/dvc-and-backblaze-b2-for-reliable-reproducible-data-science-d44d231b648f";}
post_meta:
- date
tags:
- data science
- devops
- machine learning
- work
title: DVC and Backblaze B2 for Reliable & Reproducible Data Science
type: posts
url: /2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/
---
## Introduction
When you&#8217;re working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not <a href="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7" data-type="URL" data-id="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7">really suited to large files</a> and whilst general purpose solutions exist ([Git LFS][1] being perhaps the most famous and widely adopted solution), [DVC][2] is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out [here.][3]
It&#8217;s also worth pointing out that another point in DVC&#8217;s favour is its [powerful dependency system][4] and [being able to precisely recreate data science projects down to the command line flag][5] &#8211; particularly desirable in academic and commercial R&D settings.
I use data buckets like S3 and Google Cloud Storage at work frequently and they&#8217;re very useful as an off-site backup large quantities of training data. However, in my personal life my favourite S3-like vendor is [BackBlaze][6] who offer a professional, reliable service with [cheaper data access rates than Amazon and Google][7] and [offer an S3-compatible API][8] which you can use in many places &#8211; including DVC. If you&#8217;re new to remote storage buckets or you want to try-before-you-buy, BackBlaze offer 10GB of remote storage free &#8211; plenty of room for a few hundred thousand pictures of [dogs and chicken nuggets][9] to train your classifier with.
## Setting up your DVC Project
Configuring DVC to use B2 instead of S3 is actually a breeze once you find the right incantation in the documentation. Our first step, if you haven&#8217;t done it already is to install dvc. You can download an installer bundle/debian package/RPM package from [their website][2] or if you prefer you can install it inside python via `pip install dvc[all]` &#8211; the [all] on the end pulls in all the various DVC remote storage libraries &#8211; you could swap this for [s3] if you just want to use that.
Next you will want to create your data science project &#8211; I usually set mine up like this:
<pre class="wp-block-code"><code>- README.md
- .gitignore &lt;-- prefilled with pythonic ignore rules
- environment.yml &lt;-- my conda environment yaml
- data/
- raw/ &lt;-- raw unprocessed data assets go here
- processed/ &lt;-- partially processed and pre-processed data assets go here
-
</code></pre>
Now we can initialize git and dvc:
<pre class="wp-block-code"><code>git init
dvc init</code></pre>
## Setting up your Backblaze Bucket and Credentials
Now we&#8217;re going to create our bucket in backblaze. Assuming you&#8217;ve registered an account, you&#8217;ll want to go to &#8220;My Account&#8221; in the top right hand corner, then click &#8220;Create a new bucket&#8221;<figure class="wp-block-image size-large">
<img loading="lazy" width="660" height="210" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=660%2C210&#038;ssl=1" alt="" class="wp-image-515" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?w=1008&ssl=1 1008w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=300%2C96&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=768%2C245&ssl=1 768w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure>
Enter a bucket name (little gotcha: the name must be unique across the whole of backblaze &#8211; not just your account) and click &#8220;Create a Bucket&#8221; taking the default options on the rest of the fields.<figure class="wp-block-image size-large">
<img loading="lazy" width="577" height="647" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=577%2C647&#038;ssl=1" alt="" class="wp-image-517" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?w=577&ssl=1 577w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=268%2C300&ssl=1 268w" sizes="(max-width: 577px) 100vw, 577px" data-recalc-dims="1" /></figure>
Once your bucket is created you&#8217;ll also need to copy down the &#8220;endpoint&#8221; value that shows up in the information box &#8211; we&#8217;ll need this later when we set up DVC.<figure class="wp-block-image size-large">
<img loading="lazy" width="631" height="281" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=631%2C281&#038;ssl=1" alt="" class="wp-image-522" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?w=631&ssl=1 631w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=300%2C134&ssl=1 300w" sizes="(max-width: 631px) 100vw, 631px" data-recalc-dims="1" /></figure>
We&#8217;re also going to need to create credentials for accessing the bucket. Go back to &#8220;My Account&#8221; and then &#8220;App Keys&#8221; and go for &#8220;Add a New Application Key&#8221;<figure class="wp-block-image size-large">
<img loading="lazy" width="660" height="145" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=660%2C145&#038;ssl=1" alt="" class="wp-image-518" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?w=678&ssl=1 678w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=300%2C66&ssl=1 300w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure>
Here you can enter a memorable name for this key &#8211; by convention I normally use the name of the experiment or model that I&#8217;m training. <figure class="wp-block-image size-large">
<img loading="lazy" width="570" height="574" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=570%2C574&#038;ssl=1" alt="" class="wp-image-519" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?w=570&ssl=1 570w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=298%2C300&ssl=1 298w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=150%2C150&ssl=1 150w" sizes="(max-width: 570px) 100vw, 570px" data-recalc-dims="1" /></figure>
You can leave all of the remaining options with default/empty values or you can use these to lock down your security if you have multiple users accessing your account (or in the event that your key got committed to a public github repo) &#8211; for example we could limit this key to only the bucket we just created or only folders with a certain prefix within this bucket. For this tutorial I&#8217;m assuming you left these as they were and if you change them, your mileage may vary.<figure class="wp-block-image size-large">
<img loading="lazy" width="598" height="208" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=598%2C208&#038;ssl=1" alt="" class="wp-image-521" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?w=598&ssl=1 598w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=300%2C104&ssl=1 300w" sizes="(max-width: 598px) 100vw, 598px" data-recalc-dims="1" /></figure>
Once you create the key you will need to copy down the keyID and applicationKey values &#8211; heed the warning &#8211; they will only appear once and as soon as you move off this page it will be gone forever unless you copy the values somewhere safe. It&#8217;s not the end of the world since we can create more keys but still a bit annoying to have to go through again.
If you&#8217;ve got the name of your bucket, your endpoint, your keyID and applicationKey values stored somewhere safe then we&#8217;re done here and we can move on to the next step.
## Configuring your DVC &#8216;remote&#8217;
With our bucket all set up, we can configure DVC to talk to backblaze. First we add a new remote to DVC. The `-d` flag sets this as the default (so that when we push it will send the data to this location by default without being told explicitely).
<pre class="wp-block-code"><code>dvc remote add b2 s3://your-bucket-name/</code></pre>
So DVC knows about our bucket but unless we tell it otherwise it will assume that it&#8217;s an Amazon S3 bucket rather than a B2 bucket. We need to tell it our endpoint value:
<pre class="wp-block-code"><code>dbc remote modify b2 endpointurl https://s3.us-west-002.backblazeb2.com</code></pre>
You&#8217;ll see that I&#8217;ve copied and pasted my endpoint from when I set up my bucket and stuck &#8220;https://&#8221; on the front which dvc needs to know about to form a valid URL.
## Authenticating DVC
Next we need to tell DVC about our auth keys. [In the DVC manual][10] they show you that you can use the `dvc remote modify` command to permanently store your access credentials in the DVC config file. However this stores your super-duper secret credentials in plain text in a file called `.dvc/config` which gets stored in your git repository meaning that if you&#8217;re storing your work on GitHub then Joe Public could come along and start messing with your private bucket.
Instead I advocate the following approach. Firstly, in our `.gitignore` file at the top level of our project (create one if it doesn&#8217;t exist) add a line that says `.env`
Now we&#8217;re going to create a new file &#8211; again in the top level of our project directory called `.env` and paste in the following:
<pre class="wp-block-code"><code>export AWS_ACCESS_KEY_ID='&lt;keyID>'
export AWS_SECRET_ACCESS_KEY='&lt;applicationKey>'</code></pre>
Replace <keyID> and <applicationKey> with the values from the BackBlaze web UI that we copied earlier.
What we&#8217;ve just done is create a local file that contains our credentials that git is not permitted to store in your repository and it&#8217;s easy enough to use these credentials with DVC from the terminal by running `source .env` first &#8211; don&#8217;t worry I&#8217;ll show you now.
Finally we can run `git add .dvc` followed by a `git commit` to lock in our dvc configuration in this git repository.
## Adding files to DVC
Ok so imagine you have a folder full of images for your neural model to train on. stored in `data/raw/training-data`. We&#8217;re going to add this to DVC with:
<pre class="wp-block-code"><code>dvc add data/raw/training-data</code></pre>
After you run this, you&#8217;ll get a message along these lines:
<pre class="wp-block-code"><code>100% Add|████████████████████████████████████████████████████████████|1/1 &#91;00:01, 1.36s/file]
To track the changes with git, run:
git add data/raw/.gitignore data/raw/training-data/001.jpg</code></pre>
Go ahead and execute the git command now. This will update your git repository so that the actual data (the pictures of dogs and chicken nuggets) will be gitignored but the .dvc files which contain metadata about those files and where to find them will be added to the repository. When you&#8217;re ready you can now `git commit` to save the metadata about the data to git permanently.
## Storing DVC data in backblaze
Now we have the acid test: this next step will push your data to your backblaze bucket if we have everything configured correctly. Simply run:
<pre class="wp-block-code"><code>source .env
dvc push</code></pre>
At this point you&#8217;ll either get error messages or a bunch of progress bars that will populate as the images in your folder are uploaded. Once the process is finished you&#8217;ll see a summary that says `N files pushed` where N is the number of pictures you had in your folder. If that happened then congratulations you&#8217;ve successfully configured DVC and backblaze.
## Getting the data back on another machine
If you want to work on this project with your friends on this project or you want to check out the project on your other laptop then you or they will need to install git and dvc before checking out your project from github (or wherever your project is hosted). Once they have a local copy they should be able to go into the `data/raw/training-data` folder and they will see all of the `*.dvc` files describing where the training data is.
Your git repository should have all of your dvc configuration in it already including the endpoint URL for your bucket. However, In order to check out this data they will first need to create a `.env` file of their own containing a key pair (ideally one that you&#8217;ve generated for them that is locked down as much as possible to just the project that you&#8217;d like to collaborate with them on). Then they will need to run:
<pre class="wp-block-code"><code>source .env
dvc checkout</code></pre>
This should begin the process of downloading your files from backblaze and making local copies of them in `data/raw/training-data`.
## Streamlining Workflows
One final tip I&#8217;d offer is using `dvc install` which will add hooks to git so that every time you push and pull, dvc push and pull are also automatically triggered &#8211; saving you from manually running those steps. It will also hook up dvc checkout and git checkout in case you&#8217;re working with different data assets on different project branches.
## Final Thoughts
Congratulations, if you got this far it means you&#8217;ve configured DVC and Backblaze B2 and have a perfectly reproducible data science workflow at the tips of your fingers. This workflow is well optimised for teams of people working on data science experiments that need to be repeatable or have large volumes of unwieldy data that needs a better home than git.
_If you found this post useful please leave claps and comments or follow me on twitter [@jamesravey][11] for more._
[1]: https://git-lfs.github.com/
[2]: https://dvc.org/
[3]: https://dvc.org/doc/command-reference/remote/add
[4]: https://dvc.org/doc/command-reference/dag
[5]: https://dvc.org/doc/command-reference/run
[6]: https://www.backblaze.com/
[7]: https://www.backblaze.com/b2/cloud-storage.html
[8]: https://www.backblaze.com/b2/docs/s3_compatible_api.html
[9]: http://www.mtv.com/news/2752312/bagel-or-dog-or-fried-chicken-or-dog/
[10]: https://dvc.org/doc/command-reference/remote/modify#example-customize-an-s3-remote
[11]: https://twitter.com/jamesravey