brainsteam.co.uk/new_files/public/2020/11/27/dvc-and-backblaze-b2-for-re.../index.html

210 lines
21 KiB
HTML
Raw Normal View History

2021-12-21 13:31:30 +00:00
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>DVC and Backblaze B2 for Reliable &amp; Reproducible Data Science - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="DVC and Backblaze B2 for Reliable &amp; Reproducible Data Science">
<meta itemprop="description" content="Introduction When youre working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not really suited to large files and whilst general purpose solutions exist (Git LFS being perhaps the most famous and widely adopted solution), DVC is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out here."><meta itemprop="datePublished" content="2020-11-27T15:43:48&#43;00:00" />
<meta itemprop="dateModified" content="2020-11-27T15:43:48&#43;00:00" />
<meta itemprop="wordCount" content="1659">
<meta itemprop="keywords" content="data science,devops,machine learning," /><meta property="og:title" content="DVC and Backblaze B2 for Reliable &amp; Reproducible Data Science" />
<meta property="og:description" content="Introduction When youre working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not really suited to large files and whilst general purpose solutions exist (Git LFS being perhaps the most famous and widely adopted solution), DVC is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out here." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2020-11-27T15:43:48&#43;00:00" />
<meta property="article:modified_time" content="2020-11-27T15:43:48&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="DVC and Backblaze B2 for Reliable &amp; Reproducible Data Science"/>
<meta name="twitter:description" content="Introduction When youre working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not really suited to large files and whilst general purpose solutions exist (Git LFS being perhaps the most famous and widely adopted solution), DVC is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out here."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">27</span>
<span class="rest">Nov 2020</span>
</div>
</div>
<div class="matter">
<h1 class="title">DVC and Backblaze B2 for Reliable &amp; Reproducible Data Science</h1>
</div>
</div>
<div class="markdown">
<h2 id="introduction">Introduction</h2>
<p>When youre working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not <a href="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7" data-type="URL" data-id="https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7">really suited to large files</a> and whilst general purpose solutions exist (<a href="https://git-lfs.github.com/">Git LFS</a> being perhaps the most famous and widely adopted solution), <a href="https://dvc.org/">DVC</a> is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out <a href="https://dvc.org/doc/command-reference/remote/add">here.</a></p>
<p>Its also worth pointing out that another point in DVCs favour is its <a href="https://dvc.org/doc/command-reference/dag">powerful dependency system</a> and <a href="https://dvc.org/doc/command-reference/run">being able to precisely recreate data science projects down to the command line flag</a> particularly desirable in academic and commercial R&amp;D settings.</p>
<p>I use data buckets like S3 and Google Cloud Storage at work frequently and theyre very useful as an off-site backup large quantities of training data. However, in my personal life my favourite S3-like vendor is <a href="https://www.backblaze.com/">BackBlaze</a> who offer a professional, reliable service with <a href="https://www.backblaze.com/b2/cloud-storage.html">cheaper data access rates than Amazon and Google</a> and <a href="https://www.backblaze.com/b2/docs/s3_compatible_api.html">offer an S3-compatible API</a> which you can use in many places including DVC. If youre new to remote storage buckets or you want to try-before-you-buy, BackBlaze offer 10GB of remote storage free plenty of room for a few hundred thousand pictures of <a href="http://www.mtv.com/news/2752312/bagel-or-dog-or-fried-chicken-or-dog/">dogs and chicken nuggets</a> to train your classifier with.</p>
<h2 id="setting-up-your-dvc-project">Setting up your DVC Project</h2>
<p>Configuring DVC to use B2 instead of S3 is actually a breeze once you find the right incantation in the documentation. Our first step, if you havent done it already is to install dvc. You can download an installer bundle/debian package/RPM package from <a href="https://dvc.org/">their website</a> or if you prefer you can install it inside python via <code>pip install dvc[all]</code> the [all] on the end pulls in all the various DVC remote storage libraries you could swap this for [s3] if you just want to use that.</p>
<p>Next you will want to create your data science project I usually set mine up like this:</p>
<pre class="wp-block-code"><code>- README.md
- .gitignore &lt;-- prefilled with pythonic ignore rules
- environment.yml &lt;-- my conda environment yaml
- data/
- raw/ &lt;-- raw unprocessed data assets go here
- processed/ &lt;-- partially processed and pre-processed data assets go here
-
</code></pre>
<p>Now we can initialize git and dvc:</p>
<pre class="wp-block-code"><code>git init
dvc init</code></pre>
<h2 id="setting-up-your-backblaze-bucket-and-credentials">Setting up your Backblaze Bucket and Credentials</h2>
<p>Now were going to create our bucket in backblaze. Assuming youve registered an account, youll want to go to “My Account” in the top right hand corner, then click “Create a new bucket”<figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="660" height="210" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=660%2C210&#038;ssl=1" alt="" class="wp-image-515" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?w=1008&ssl=1 1008w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=300%2C96&ssl=1 300w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-1.png?resize=768%2C245&ssl=1 768w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure></p>
<p>Enter a bucket name (little gotcha: the name must be unique across the whole of backblaze not just your account) and click “Create a Bucket” taking the default options on the rest of the fields.<figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="577" height="647" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=577%2C647&#038;ssl=1" alt="" class="wp-image-517" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?w=577&ssl=1 577w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-3.png?resize=268%2C300&ssl=1 268w" sizes="(max-width: 577px) 100vw, 577px" data-recalc-dims="1" /></figure></p>
<p>Once your bucket is created youll also need to copy down the “endpoint” value that shows up in the information box well need this later when we set up DVC.<figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="631" height="281" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=631%2C281&#038;ssl=1" alt="" class="wp-image-522" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?w=631&ssl=1 631w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-7.png?resize=300%2C134&ssl=1 300w" sizes="(max-width: 631px) 100vw, 631px" data-recalc-dims="1" /></figure></p>
<p>Were also going to need to create credentials for accessing the bucket. Go back to “My Account” and then “App Keys” and go for “Add a New Application Key”<figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="660" height="145" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=660%2C145&#038;ssl=1" alt="" class="wp-image-518" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?w=678&ssl=1 678w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-4.png?resize=300%2C66&ssl=1 300w" sizes="(max-width: 660px) 100vw, 660px" data-recalc-dims="1" /></figure></p>
<p>Here you can enter a memorable name for this key by convention I normally use the name of the experiment or model that Im training. <figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="570" height="574" src="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=570%2C574&#038;ssl=1" alt="" class="wp-image-519" srcset="https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?w=570&ssl=1 570w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=298%2C300&ssl=1 298w, https://i2.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-5.png?resize=150%2C150&ssl=1 150w" sizes="(max-width: 570px) 100vw, 570px" data-recalc-dims="1" /></figure></p>
<p>You can leave all of the remaining options with default/empty values or you can use these to lock down your security if you have multiple users accessing your account (or in the event that your key got committed to a public github repo) for example we could limit this key to only the bucket we just created or only folders with a certain prefix within this bucket. For this tutorial Im assuming you left these as they were and if you change them, your mileage may vary.<figure class="wp-block-image size-large"></p>
<p><img loading="lazy" width="598" height="208" src="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=598%2C208&#038;ssl=1" alt="" class="wp-image-521" srcset="https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?w=598&ssl=1 598w, https://i1.wp.com/brainsteam.co.uk/wp-content/uploads/2020/11/image-6.png?resize=300%2C104&ssl=1 300w" sizes="(max-width: 598px) 100vw, 598px" data-recalc-dims="1" /></figure></p>
<p>Once you create the key you will need to copy down the keyID and applicationKey values heed the warning they will only appear once and as soon as you move off this page it will be gone forever unless you copy the values somewhere safe. Its not the end of the world since we can create more keys but still a bit annoying to have to go through again.</p>
<p>If youve got the name of your bucket, your endpoint, your keyID and applicationKey values stored somewhere safe then were done here and we can move on to the next step.</p>
<h2 id="configuring-your-dvc-8216remote8217">Configuring your DVC remote</h2>
<p>With our bucket all set up, we can configure DVC to talk to backblaze. First we add a new remote to DVC. The <code>-d</code> flag sets this as the default (so that when we push it will send the data to this location by default without being told explicitely).</p>
<pre class="wp-block-code"><code>dvc remote add b2 s3://your-bucket-name/</code></pre>
<p>So DVC knows about our bucket but unless we tell it otherwise it will assume that its an Amazon S3 bucket rather than a B2 bucket. We need to tell it our endpoint value:</p>
<pre class="wp-block-code"><code>dbc remote modify b2 endpointurl https://s3.us-west-002.backblazeb2.com</code></pre>
<p>Youll see that Ive copied and pasted my endpoint from when I set up my bucket and stuck “https://” on the front which dvc needs to know about to form a valid URL.</p>
<h2 id="authenticating-dvc">Authenticating DVC</h2>
<p>Next we need to tell DVC about our auth keys. <a href="https://dvc.org/doc/command-reference/remote/modify#example-customize-an-s3-remote">In the DVC manual</a> they show you that you can use the <code>dvc remote modify</code> command to permanently store your access credentials in the DVC config file. However this stores your super-duper secret credentials in plain text in a file called <code>.dvc/config</code> which gets stored in your git repository meaning that if youre storing your work on GitHub then Joe Public could come along and start messing with your private bucket.</p>
<p>Instead I advocate the following approach. Firstly, in our <code>.gitignore</code> file at the top level of our project (create one if it doesnt exist) add a line that says <code>.env</code></p>
<p>Now were going to create a new file again in the top level of our project directory called <code>.env</code> and paste in the following:</p>
<pre class="wp-block-code"><code>export AWS_ACCESS_KEY_ID='&lt;keyID>'
export AWS_SECRET_ACCESS_KEY='&lt;applicationKey>'</code></pre>
<p>Replace <keyID> and <applicationKey> with the values from the BackBlaze web UI that we copied earlier.</p>
<p>What weve just done is create a local file that contains our credentials that git is not permitted to store in your repository and its easy enough to use these credentials with DVC from the terminal by running <code>source .env</code> first dont worry Ill show you now.</p>
<p>Finally we can run <code>git add .dvc</code> followed by a <code>git commit</code> to lock in our dvc configuration in this git repository.</p>
<h2 id="adding-files-to-dvc">Adding files to DVC</h2>
<p>Ok so imagine you have a folder full of images for your neural model to train on. stored in <code>data/raw/training-data</code>. Were going to add this to DVC with:</p>
<pre class="wp-block-code"><code>dvc add data/raw/training-data</code></pre>
<p>After you run this, youll get a message along these lines:</p>
<pre class="wp-block-code"><code>100% Add|████████████████████████████████████████████████████████████|1/1 &#91;00:01, 1.36s/file]
To track the changes with git, run:
git add data/raw/.gitignore data/raw/training-data/001.jpg</code></pre>
<p>Go ahead and execute the git command now. This will update your git repository so that the actual data (the pictures of dogs and chicken nuggets) will be gitignored but the .dvc files which contain metadata about those files and where to find them will be added to the repository. When youre ready you can now <code>git commit</code> to save the metadata about the data to git permanently.</p>
<h2 id="storing-dvc-data-in-backblaze">Storing DVC data in backblaze</h2>
<p>Now we have the acid test: this next step will push your data to your backblaze bucket if we have everything configured correctly. Simply run:</p>
<pre class="wp-block-code"><code>source .env
dvc push</code></pre>
<p>At this point youll either get error messages or a bunch of progress bars that will populate as the images in your folder are uploaded. Once the process is finished youll see a summary that says <code>N files pushed</code> where N is the number of pictures you had in your folder. If that happened then congratulations youve successfully configured DVC and backblaze.</p>
<h2 id="getting-the-data-back-on-another-machine">Getting the data back on another machine</h2>
<p>If you want to work on this project with your friends on this project or you want to check out the project on your other laptop then you or they will need to install git and dvc before checking out your project from github (or wherever your project is hosted). Once they have a local copy they should be able to go into the <code>data/raw/training-data</code> folder and they will see all of the <code>*.dvc</code> files describing where the training data is.</p>
<p>Your git repository should have all of your dvc configuration in it already including the endpoint URL for your bucket. However, In order to check out this data they will first need to create a <code>.env</code> file of their own containing a key pair (ideally one that youve generated for them that is locked down as much as possible to just the project that youd like to collaborate with them on). Then they will need to run:</p>
<pre class="wp-block-code"><code>source .env
dvc checkout</code></pre>
<p>This should begin the process of downloading your files from backblaze and making local copies of them in <code>data/raw/training-data</code>.</p>
<h2 id="streamlining-workflows">Streamlining Workflows</h2>
<p>One final tip Id offer is using <code>dvc install</code> which will add hooks to git so that every time you push and pull, dvc push and pull are also automatically triggered saving you from manually running those steps. It will also hook up dvc checkout and git checkout in case youre working with different data assets on different project branches.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>Congratulations, if you got this far it means youve configured DVC and Backblaze B2 and have a perfectly reproducible data science workflow at the tips of your fingers. This workflow is well optimised for teams of people working on data science experiments that need to be repeatable or have large volumes of unwieldy data that needs a better home than git.</p>
<p><em>If you found this post useful please leave claps and comments or follow me on twitter <a href="https://twitter.com/jamesravey">@jamesravey</a> for more.</em></p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/data-science">data science</a></li>
<li><a href="/tags/devops">devops</a></li>
<li><a href="/tags/machine-learning">machine learning</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>