diff --git a/brainsteam/content/posts/2023/08/14/airbyte/index.md b/brainsteam/content/posts/2023/08/14/airbyte/index.md index 65d3507..181138e 100644 --- a/brainsteam/content/posts/2023/08/14/airbyte/index.md +++ b/brainsteam/content/posts/2023/08/14/airbyte/index.md @@ -20,12 +20,12 @@ tags: ## Introduction -Airbyte is an ELT tool that allows you to periodically extract data from one database and then load and transform it into another. Airbyte provides a performant way to clone data between databases and gives us the flexibility to dictate what gets shared at field level (for example we can copy the users table but we can omit name, address, phone number etc). There are a bunch of use cases where this kind of thing might be useful. For example, say you have a data science team who want to generate reports on how many sales your e-shop made this month and train predictive models for next month's revenue. In this case, you wouldn't want to give your data team direct access to your e-shop database because: +Airbyte is a tool that allows you to periodically extract data from one database and then load and transform it into another. It provides a performant way to clone data between databases and gives us the flexibility to dictate what gets shared at field level (for example we can copy the users table but we can omit name, address, phone number etc). There are a bunch of use cases where this kind of thing might be useful. For example, say you have a data science team who want to generate reports on how many sales your e-shop made this month and train predictive models for next month's revenue. You wouldn't want to give your data team direct access to your e-shop database because: 1. there might be sensitive personal information (SPI) in there (think names, addresses, bank details, links to what individual users ordered) 2. running this kind of analysis might impact the performance of your shop database and customers might start churning. -Instead, we can use a tool, such as airbyte, regularly make copies of a subset of the production database minus the SPI and load it into an external analytics database that the data team can use to make complex queries all day long without affecting application performance. This pattern is called Extract Load Transform or ELT. +Instead, we can use a tool, such as airbyte, regularly make copies of a subset of the production database minus the SPI and load it into an external analytics database. The data team can then use this external database to make complex queries all day long without affecting application performance. This pattern is called Extract Load Transform or ELT. In this post I'll summarise some strategies for using airbyte in a production environment and share some tips for navigating some of it's "rough edges" based on my own experience of setting up this tool as a pilot project for some of my clients in my day job. @@ -58,7 +58,7 @@ I have a risk averse client-base who have strong data protection requirements so If you opt for the self-hosted version like we did you'll need to pick a VM that has enough resources to run Airbyte. We went for google's `n2-standard-4` machine spec which has 4 CPU cores and 16GB of RAM. This was actually our second choice after picking an `e2-standard-2` which only had 8GB of RAM which was not enough to run Airbyte optimally and caused thrashing/spiking issues. -Although all the data does pass through the VM, it's done in buffered chunks so your VM doesn't need a lot of storage space - 50GiB is sufficient for our setup. +Although all the data does pass through the VM, it's done in buffered chunks so your VM doesn't need a lot of storage space - 50GiB was sufficient for our setup. ## Setting up Airbyte @@ -171,6 +171,10 @@ export OCTAVIA_ENABLE_TELEMETRY=True alias octavia="docker run -i --rm -v \$(pwd):/home/octavia-project --network host --env-file \${OCTAVIA_ENV_FILE} --user \$(id -u):\$(id -g) airbyte/octavia-cli:0.50.7" ``` +### Octavia and SSL + +Unfortunately I couldn't find an easy way to get Octavia to play nicely with self-signed SSL certificates which meant we had to load in an externally "signed" SSL cert. Octavia is written in Python and uses [requests](https://pypi.org/project/requests/) to interact with Airbyte so you could theoretically configure it to trust a self-signing certificate authority (as per [this stackoverflow post](https://stackoverflow.com/questions/30405867/how-to-get-python-requests-to-trust-a-self-signed-ssl-certificate)). + ## Keeping an Eye on Things Once you have your sync up and running you probably want to make sure it keeps running regularly. Airflow has slack webhook integration which means that it's easy enough to have it automatically notify you when sync has passed or failed.