Data Lake

A data lake is nothing more than a file system used as a common data sharing platform. The idea is that the data should be easily available and have a short way from it source to the consumer end. Unlike a relational database, it does not matter what format the data is in or how it is structured. It also does not have to fit into a data model or be bound by any other constraints.

A data lake is as mentioned a file system, and the Azure data lake is actually a HDFS technology, where data is distributed and copied over multiple drives. Due to the data duplication, the solution is by default highly available.

On top of the HDFS technology there is a partition layer responsible for distributing the file partitions over the network. The partition and distribution makes sure that the same performance is achieved more or less regardless of the eventual file size you end up with. However, the partitioning scheme perform better on larger files rather than smaller ones, because of the overhead of scanning and merging.

Data Lake Organization

A data lake is an hierarchical file system, like the one you have on your computer. This makes it very flexible to use, but this flexibility also opens the door to an un-organized storage area that is most often compared with a swamp where we have the risk of not finding what we want, rather than a lake where everything is nice and structured. It is therefore important to add an organizational framework to the data lake, that should be enforced with access privileges.

In our data lake we have added 4 zones, each zone has its own structure, created to support the different usage that is related to each zone.

  • Stage (not included in this post)
  • Raw
  • Cleansed (not included in this post)
  • Curate
  • Analytics
  • Consumption

Since the data lake is, as stated multiple times, nothing but a file system, everything mentioned in this post can be thought of as a folder and file on your computer. The only difference is the underlying technology and the fact that multiple users are going to work on the same files, and therefore have to posses the same mindset, in order to keep everything intuitive for the time to come.

Raw data zone

This is where source files are stored. It is important that no changes has been done to the files in this area. We want the pure data created by the source owning the data.

The Raw data zone consist of a hierarchy focused on the data source and the entity.

  • Data source (crm)
    • Entity (incident)
      • Version (v01)
        • Year loaded (2019)
          • Month loaded (1)
            • Day Loaded (1)
              • Files (incident_2019_01_01_uniqueKey.csv)

Other data lake architects argue to have the raw data zone structured by subject, meaning that you add a top directory to the hierarchy explaining what kind of data that is stored in the directory. This is intuitive in the case where you would like to make the raw layer available in for multiple users for fast analytics. By my experience, however, the data in the raw data zone has to go through cleansing and structuring to make it efficiently readable for analysts, and I therefore treat the raw layer more like a technical layer where the structure is mirroring my integration code structure. This makes it easier for me to see what code generating what result and implement efficient troubleshooting.

Analytic data zone

The analytics zone is created with the simple purpose of preventing the other layers from suffering from the development and test process. When starting up a analysis, you would usually consume data from either a already cleansed and structured file in the curated zone or from a new file in the raw zone. You use the analytic zone as an area where you store your files during the development process to validate the steps you need until you can deploy your code. The output from the analysis, will usually end up in either the curated or consumer zone, and the code used migrated to a production environment.

  • Type (Sandbox)
    • Name (UserName/ProgramName)
      • Self structured(*)

Curated data zone

This zone is where prepared and organized data is stored, for the ones used to working with databases, this is the closest you will get to a structured data area. The data that is stored in this layer should have been cleansed, structured and validated. This layer is often modelled from a purpose point of view.

  • Purpose (SalesAnalysis)
    • Type (Aggregates)
      • Snapshot date (20190601)
        • Files (SalesbyMonth.parquet)

But I prefer to structure it more like a relational database.

  • Entity
    • Version
      • Source
        • Year
          • Month
            • day
              • File

By doing this, I get to use well known modelling principles where you facilitate the option to have multiple sources to the same data over time. This is common in bigger companies that have had multiple systems over the years supporting the same business function.

Consumption area:

This is what can be related to as a data mart or API layer. In this layer you will find files that aims to serve as a dedicated purpose for the business. The files stored in this layer is usually non-relational made for a direct consumption. The file format used in this area may differ, but it is usually created with the purpose of serving the limitation of the consumer.

  • Consumer
    • DescribingDataContentName (SalesbyMonthPrediction)
      • Version (v1)
        • PartitionKeyName (YYYYMM)
          • FileName (SalePrediction.csv)

Access to the data lake

The zone organization must be enforced with privilege rights through governed systems like Active Directory. There is no easy way to handle privileges on a file system, but a good start involves a RBAC setup with the following groups in Azure AD:

  • dl_admin (full access)
  • dl_sys_contributer
  • dl_raw_data_reader_<source>
  • dl_curated_data_reader_<purpose>
  • dl_consume_data_reader_<DescribingDataContentName>

With this setup we aim for a minimum maintenance solution in AD as well as sufficient privilege split to enforce the framework. You may notice that the analytic zone is not listed. This is because this is a more personal working space and is by its function more maintainable on a direct access, rather than through groups. Remember to create the groups before developing on the data lake and make sure to give access to the outer directories as the privileges will be copied down from the outer to inner directories

ACL is user or group specific right to write, read or execute on a file or directory. ACL works by setting the permissions on the actual object (directory or file). The permission called default will provide the feature of the user to inherit its rights from a directory to all underlying directories that will be created in the future. But this inheritance does not apply permissions on already existing objects.

Leave a Reply

%d bloggers like this: