Create an AWS Glue Job to Convert the Raw Data into a Delta Table – Data Lake, Lake House, and Delta Lake

Create an AWS Glue Job to Convert the Raw Data into a Delta Table

Go to AWS Glue and click on Go to workflows as shown in Figure 3-60. We will get navigated to the AWS Glue studio, where we can create the AWS Glue job to convert the raw CSV file into a delta table.

Figure 3-60.  Go to workflows

Click on Jobs as in Figure 3-61. We need to create the workflow for the job and execute it to convert the raw CSV file into a delta table.

Figure 3-61.  Click on Jobs

We will be using the delta lake connector in the AWS Glue job. We need to activate it.

Go to the Marketplace, as in Figure 3-62.

Figure 3-62.  Click on Marketplace

Search for delta lake as in Figure 3-63, and then click on Delta Lake Connector for AWS Glue.

Figure 3-63.  Activate delta lake connector

Click on Continue to Subscribe as in Figure 3-64.

Figure 3-64.  Continue to Subscribe

Click on Continue to Configuration as in Figure 3-65.

Figure 3-65.  Continue to Configuration

Click on Continue to Launch as in Figure 3-66.

Figure 3-66.  Continue to Launch

Click on Usage instructions as in Figure 3-67. The usage instructions will open up. We have some steps to perform on that page.

Figure 3-67.  Click Usage Instructions

Read the usage instructions once and then click on Activate the Glue connector from AWS Glue Studio, as in Figure 3-68.

Figure 3-68.  Activate Glue connector

Provide a name for the connection, as in Figure 3-69, and then click on Create a connection and activate connector.

Figure 3-69.  Create connection and activate connector

Once the connector gets activated, go back to the AWS Glue studio and click on Jobs, as in Figure 3-70.

Figure 3-70.  Click on Jobs

Select option Visual with a source and target and depict the source as Amazon S3, as in Figure 3-71.

Figure 3-71.  Configure source

Select the target as Delta Lake Connector 1.0.0 for AWS Glue 3.0, as in Figure 3-72. We activated the connector in the marketplace and created a connection earlier. Click on Create.

Figure 3-72.  Configure target

Go to the S3 bucket and copy the S3 URI for the raw-data folder, as in Figure 3-73.

Figure 3-73.  Copy URI

Provide the S3 URI you copied and configure the data source, as in Figure 3-74. Make sure you mark the data format as CSV and check Recursive.

Figure 3-74.  Source settings

Scroll down and check the First line of the source file contains column headers option, as in Figure 3-75.

Figure 3-75.  Source settings

Click on ApplyMapping as in Figure 3-76 and provide the data mappings and correct data types. The target parquet file will be generated with these data types.

Figure 3-76.  Configure mappings

Click on Delta Lake Connector, and then click on Add new option, as in Figure 3-77.

Figure 3-77.  Add new option

Provide Key as path and Value as URI for the delta-lake folder in the S3 bucket, as in Figure 3-78.

Figure 3-78.  Configure target

Go to the Job details as in Figure 3-79 and set IAM Role as the role we created as a prerequisite with all necessary permissions.

Figure 3-79.  Provide IAM role

Click on Save as in Figure 3-80 and then run the job.

Figure 3-80.  Save and run

Once the job runs successfully, go to the delta-lake folder in the S3 bucket. You can see the delta table parquet file generated in Figure 3-81.

Figure 3-81.  Generated delta table in the delta-lake folder

Leave a Reply

Your email address will not be published. Required fields are marked *