Organizing Inbound Data
Organizing your preboarding data layout upfront makes the onboarding process efficient
Last updated
Organizing your preboarding data layout upfront makes the onboarding process efficient
Last updated
Preboarding is:
Data collection
Dataset identification (a.k.a. file registration)
Validation
Upgrading
Staging and archiving
After data is preboarded it is considered known and trustworthy and ready for ETL/ELT, workflow processes, use in applications, and the data lake.
All onboarding processes include these data preboarding steps. The only question is how effective, efficient, and low-risk their implementation is. Often companies under-invest in the preboarding stage, resulting in manual validation and handling, more support issues, and rework. Shortcuts in preboarding have substantial long-term costs to the business.
Underinvestment in preboarding exposes your company to
Manual handling costs
Support time
Rework by developers
Refunds to annoyed customers
Files arrive through a Managed File Transfer process. MFT is a big topic. It includes:
SFTP / FTPS
MFT servers providing a range of protocols and limited workflow support
AS2 / AS4 file transfer, often used with EDI-related files
Cloud buckets attached to cloud functions or other compute
etc.
In a few cases we see files dropped into a common area, differentiated only by file name or content. But that is rare. Files are typically stored in one of a few ways:
By time of arrival
By data partner
By target application
By jurisdiction
By transaction or business process
These organizing concepts will be layered on on the other. For example, an orders business process alignment may include date and sales region in a hierarchy like this:
These are file directories in a file system holding CSV files. Clearly, this data layout is going to make it easy to find all orders in 2025 but much harder to see all the files holding the different sales person orders. Another layout might provide easy access in a different way.
How you arrange your data matters! Of course, at different times you may need different layouts. Since your data will eventually land in a database or warehouse or application, presumably your needs will ultimately be met. However, you have to consider how you store your data in preboarding in order to make onboarding and long-term reference to the source data efficient.