brainsteam.co.uk/brainsteam/content/annotations/2022/11/20/1668943111.md at c2ab2a61a1048e0d2720d2c1e72df98946e8ac4a

3.0 KiB

Raw Blame History

date

hypothesis-meta

in-reply-to

tags

target

text

updated

uri

user

user_info

2022-11-20T11:18:31.041323+00:00

title

Data Engineering in 2022: ELT tools

false

__world__

false

EF4wWGjFEe2zrM9D4rCx-g

html	incontext	json
https://hypothes.is/a/EF4wWGjFEe2zrM9D4rCx-g	https://hyp.is/EF4wWGjFEe2zrM9D4rCx-g/rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/	https://hypothes.is/api/annotations/EF4wWGjFEe2zrM9D4rCx-g

admin

delete

read

update

acct:ravenscroftj@hypothes.is

group:__world__

acct:ravenscroftj@hypothes.is

data-engineering

data-science

ELT

selector

source

endContainer	endOffset	startContainer	startOffset	type
/main[1]/article[1]/div[3]/ul[1]/li[1]/div[2]/p[1]	383	/main[1]/article[1]/div[3]/ul[1]/li[1]/div[2]/p[1]	0	RangeSelector

end	start	type
2093	1710	TextPositionSelector

exact	prefix	suffix	type
Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.	keep it at a manageable size.	Of course, despite what the	TextQuoteSelector

https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/

absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"

2022-11-20T11:18:31.041323+00:00

https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/

acct:ravenscroftj@hypothes.is

display_name
James Ravenscroft

https://rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/

data-engineering

data-science

ELT

hypothesis

annotation

/annotation/2022/11/20/1668943111

Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.

3.0 KiB Raw Blame History Unescape Escape

3.0 KiB

Raw Blame History