brainsteam.co.uk/brainsteam/content/annotations/2022/11/20/1668943111.md

3.0 KiB
Raw Blame History

date hypothesis-meta in-reply-to tags type url
2022-11-20T11:18:31
created document flagged group hidden id links permissions tags target text updated uri user user_info
2022-11-20T11:18:31.041323+00:00
title
Data Engineering in 2022: ELT tools
false __world__ false EF4wWGjFEe2zrM9D4rCx-g
html incontext json
https://hypothes.is/a/EF4wWGjFEe2zrM9D4rCx-g https://hyp.is/EF4wWGjFEe2zrM9D4rCx-g/rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/ https://hypothes.is/api/annotations/EF4wWGjFEe2zrM9D4rCx-g
admin delete read update
acct:ravenscroftj@hypothes.is
acct:ravenscroftj@hypothes.is
group:__world__
acct:ravenscroftj@hypothes.is
data-engineering
data-science
ELT
selector source
endContainer endOffset startContainer startOffset type
/main[1]/article[1]/div[3]/ul[1]/li[1]/div[2]/p[1] 383 /main[1]/article[1]/div[3]/ul[1]/li[1]/div[2]/p[1] 0 RangeSelector
end start type
2093 1710 TextPositionSelector
exact prefix suffix type
Working with the raw data has lots of benefits, since at the point of ingest you dont know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well. keep it at a manageable size. Of course, despite what the TextQuoteSelector
https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/
absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before" 2022-11-20T11:18:31.041323+00:00 https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/ acct:ravenscroftj@hypothes.is
display_name
James Ravenscroft
https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/
data-engineering
data-science
ELT
hypothesis
annotation /annotation/2022/11/20/1668943111
Working with the raw data has lots of benefits, since at the point of ingest you dont know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.
absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"