Category Whitepapers and Guides
ECS had already designed and delivered an AWS-hosted environment to replace an aging retail risk analytics platform for this leading UK retail bank. Following this, the bank now embarked on a 2‑year programme to implement a cloud‑native and enterprise‑scale data lake. ECS was engaged to deliver both the technical design and implementation.
The bank’s existing data lake was supported across Hadoop clusters out of its on-premise data centres. This presented the bank with quite constraining logistical challenges making it difficult to quickly spin up the right resources for analytic workloads, particularly in support of Machine Learning training and execution. In order to cater for this kind of fluctuating demand the Bank would have required the planning for and provisioning of sufficient resources in the data centre to be able to support peak levels of activity. This would have effectively meant that expensive resources were frequently underused or left idle. At other times the available capacity was inadequate to service peak data and analytical workloads in a timely manner. It was important to the bank that access to existing information was not interrupted as the programme progressed to build the new data lake.
ECS appreciated that critical to the success of this project was properly understanding the types of data being handled as well as recognising the different source patterns of the data. A framework was developed that would allow ECS to define a series of different paths into the data lake applicable for the data to use each path.
Data was categorised as transactional or table-copy and could be partitioned for collation from one of three source patterns:
At this point all of the data in scope for the project was structured, but the categories and patterns will apply equally for unstructured data.
Using AWS native tools, a pathway into the cloud-based data lake for each category of data from each of the sources was established. This included using AWS SFTP and AWS Kinesis to ingest data into an S3 bucket to be processed using AWS Glue ETL to transform the data into Parquet format and stored in the data lake. AWS Glue was used to create and maintain data schemas used by data scientists to access data through AWS Athena.
With all of the design, infrastructure and cloud services in place, it was then possible to begin filling the lake in a controlled manner and supporting analytical workloads . From the outset of the project to the first production data loads being ingested into the data lake took less than 5 months.