fix links
continuous-integration/drone/push Build is passing Details

This commit is contained in:
James Ravenscroft 2021-12-31 16:34:24 +00:00
parent f703598ed8
commit b06d7279da
1 changed files with 1 additions and 1 deletions

View File

@ -50,7 +50,7 @@ In September, my colleague Cynthia won the Best KTP Award for her collaboration
At work I led the adoption of [MLFlow](https://mlflow.org/) for storing all of our machine learning experiments and results. This was a huge win in terms of productivity, reproducibility and transparency for the data science team as it means that we always know which models were trained, when, by whom, with which data, where that data is, what parameters were used and what performance was achieved. I [wrote a post about some of the challenges of using MLFlow with NLP models](/2020/12/29/serving-nlp-models-with-mlflow/) earlier in the year.
We've also adopted [DVC](https://dvc.org/) for tracking large data files (i.e. training data sets) without committing the data itself to git. This means that we know exactly which data was used for running a given script/model but that data is not clogging up our git repositories (which slows down checking projects out), it is secure (even if you have access to our git server, you also need credentials to access the data bucket) and access to the data is auditable in a pinch (we can use S3 buckets with paranoid logging). I also [wrote a little about using DVC with backblaze](http://localhost:1313/2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/) which is something I do for personal projects and my PHD work at the end of last year. I've started using DVC for tracking and reproducing script runs as well but I've still got to write that up into a blog post and some guidelines for my team.
We've also adopted [DVC](https://dvc.org/) for tracking large data files (i.e. training data sets) without committing the data itself to git. This means that we know exactly which data was used for running a given script/model but that data is not clogging up our git repositories (which slows down checking projects out), it is secure (even if you have access to our git server, you also need credentials to access the data bucket) and access to the data is auditable in a pinch (we can use S3 buckets with paranoid logging). I also [wrote a little about using DVC with backblaze](/2020/11/27/dvc-and-backblaze-b2-for-reliable-reproducible-data-science/) which is something I do for personal projects and my PHD work at the end of last year. I've started using DVC for tracking and reproducing script runs as well but I've still got to write that up into a blog post and some guidelines for my team.
I also formalised some guidelines on best practices for Python development within the data science team at work. Python dependency management can be a real PITA. I've been doing Python dev since 2005 and things have really come on leaps and bounds in the last few years with the introduction of tools like [Poetry](https://python-poetry.org/) and [pipenv](https://pipenv.pypa.io/en/latest/). Earlier in the year I published [some of my thoughts](/2021/04/01/opinionated-guide-to-virtualenvs/) on how best to handle python environments and dependencies that we've now adopted within Filament.