Azure Databricks Delta Lake Best Practices

Necessary cookies
Name	Hostname	Vendor	Expiry
cookiehub	.gspann.com	CookieHub	365 days
Used by CookieHub to store information about whether visitors have given or declined the use of cookie categories used on the site.
__cf_bm	.vimeo.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
_cfuvid	.vimeo.com	Cloudflare, Inc.	Session
Used by Cloudflare WAF to distinguish individual users who share the same IP address and apply rate limits
jobFilterFlag			Persistent
@@scroll\|/			Session
Allows restoring a user’s scroll position when navigating to a new page.
_hjDonePolls	.gspann.com		365 days

Preferences
Name	Hostname	Vendor	Expiry
li_gc	.linkedin.com	LinkedIn Ireland Unlimited Company	180 days
Stores user consent preferences for LinkedIn cookies to comply with privacy regulations.
lidc	.linkedin.com	LinkedIn Ireland Unlimited Company	1 day
Maintains session routing for reliable LinkedIn service performance.
_gcl_au	.gspann.com	Google Advertising Products	90 days
Used by Google AdSense to understand user interaction with the website by generating analytical data.
vuid	.vimeo.com		400 days
These cookies are used by the Vimeo video player on websites.
_gcl_ls		Google Advertising Products	Persistent
Used by Google AdSense to understand user interaction with the website by generating analytical data.
_hjSessionUser_	.gspann.com	Hotjar	365 days
Hotjar cookie. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjSession_	.gspann.com	Hotjar	1 hour
Used by Hotjar to hold current session data.
prism_	prism.app-us1.com	HubSpot	30 days
Used by ActiveCampaign to track visitors across marketing channels.
hjActiveViewportIds		Hotjar	Persistent
User by Hotjar to store user active viewports IDs and an expirationTimestamp that is used to validate active viewports on script initialization.
hjViewportId		Hotjar	Session
This Hotjar cookie stores information about the user viewport such as size and dimensions.

Analytical cookies
Name	Hostname	Vendor	Expiry
_ga	.gspann.com	Google LLC	400 days
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
bcookie	.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
This is a Microsoft MSN 1st party cookie for sharing the content of the website via social media.
_ga_	.gspann.com	Google LLC	400 days
Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.

Marketing cookies
Name	Hostname	Vendor	Expiry
IDE	.doubleclick.net	Google Advertising Products	390 days
Used by Google's DoubleClick to serve targeted advertisements that are relevant to users across the web. Targeted advertisements may be displayed to users based on previous visits to a website. These cookies measure the conversion rate of ads presented to the user.
AnalyticsSyncHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
UserMatchHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Contains a unique identifier used by LinkedIn to determine that two distinct hits belong to the same user across browsing sessions.
bscookie	.www.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
Used by the social networking service, LinkedIn, for tracking the use of embedded services.
test_cookie	.doubleclick.net	Google Advertising Products	1 hour
Used to check if the user's browser supports cookies
_hjDonePolls		Hotjar	Persistent

In this modern era, data has become a game-changer. It helps organizations make critical and informed business decisions. While many companies collect data through various sources, the most challenging part, the differentiator, is to store and manage the data effectively. Given that the escalating volume of data is becoming increasingly challenging to handle, there is a notable shift from traditional databases towards more contemporary architectural layers, including cloud-based solutions, data lakes, and delta lakes. What is Delta Lake in Databricks? Delta Lake is an open-source storage framework that stores data and tables in Parquet files. Delta Lake is compatible with Apache Spark, which supports batch and stream data processing. Azure Databricks uses Delta Lake as its default storage and refers to its tables as Delta tables. The underlying storage layer for Databricks Delta Lake can be AWS S3, GCP GCS, or Azure BLOB. This article discusses Databricks Delta Lake implementation in Azure and shares best practices to optimize the reads and writes in Delta Lake for better efficiency. <div><img src="https://images.prismic.io/gspann/ef12f4fd-88ff-4b5e-a43e-c9497a11e6e0_Image+1_+Delta+Lake+Architecture.png?auto=compress,format"></div> Benefits of Using Delta Lake The Databricks Delta Lake implementation has many benefits. It is built on a hybrid model that embraces data warehousing and Spark multi-processing principles. A typical implementation would provide support for: <ul> <li>ACID transactions: Delta Lake supports <a href="https://www.databricks.com/glossary/acid-transactions#:~:text=ACID%20is%20an%20acronym%20that,operations%20are%20called%20transactional%20systems">ACID transactions</a> (Atomicity, Consistency, Isolation, and Durability), which provides consistent data and transactional integrity.</li> <li>Scalable: All metadata operations in Delta Lake use an Apache Spark compute engine where data is scalable. This ensures Delta Lake can handle enormous amounts of data without a performance bottleneck.</li> <li>Security and Governance: Delta Lake provides secure and reliable storage of data, which has fine-grained access control for all underlying data at various levels viz, table, row, and column. Security is easy to implement and protects sensitive data from unauthorized access.</li> <li>Time Travel: Imagine if your business suffered from data corruption or incorrect data updated to a sizeable transactional table. In a traditional database, restoring integrity can cost a substantial amount of time and money. However, Delta Lake provides a version history feature that helps prevent data corruption. The table data can roll back to previous versions by running a simple command in Databricks Notebook.</li> <li>Data Sharing: Delta tables can be easily and securely shared with other users. It is platform-independent and supports recipients from outside your organization as well.</li> <li>Live Tables: You can use the Delta Lake Live Tables feature to build reliable and sustainable data processing pipelines. Live Tables provides an end-to-end solution for your ETL (Extract, Transfer, Load) pipeline. This feature supports data selection, transformation, quality control, automated testing, and monitoring.</li></ul> <div><img src="https://images.prismic.io/gspann/d5f4cac0-0402-4554-b1c8-1a2912274b71_Image+2_+Delta+Lake+Benefits.png?auto=compress,format"></div> For many people, the difference between “Delta Lake” and “Data Lake” is not very clear. Our blog, “<a href="https://www.gspann.com/resources/blogs/delta-lake-vs-data-lake--which-one-provides-high-quality-data-you-can-trust/">Delta Lake vs. Data Lake - Which One Provides High-Quality Data You Can Trust?</a>”. Implementing Delta Lake The strength of Delta Lake lies in its compatibility with various programming languages, like Spark SQL, Scala, and Python. To get started, it requires specific prerequisite libraries to be installed, depending on the programmer’s preference. Once these libraries are in place, the process becomes quite straightforward. It makes establishing a Delta table as easy as executing a Data Definition Language (DDL) statement, just like any relational database. <a href="https://www.gspann.com/resources/blogs/delta-lake-vs-data-lake--which-one-provides-high-quality-data-you-can-trust/">Click here to explore more on implementing Delta Lake</a>. <div id="contentBlocker"></div> Here’s a skeleton statement to create a Delta table: <code> <pre style="padding: 0px 20px; margin: 0px; font-size: 15px; text-align: left; border:1px solid #B5B5B5; background-color: #F4F4F4;">CREATE TABLE delta. <delta-table-name> <columns, datatype> USING DELTA OPTIONS (path <Azure blob storage path>)</pre> </code> Databricks Delta tables support common Data Manipulation Language (DML) operations such as SELECT, INSERT, UPDATE, DELETE, and UPSERT, which you can write in SQL, Python, Scala, and Java. Delta Lake processing is powered by a Databricks compute engine, which is part of a Databricks cluster. The cluster needs to be up and running to read or write data from the Delta Lake tables. You can use the same cluster to run any Python, Scala, or SQL commands in the Databricks notebooks. Optimizing and Implementing Best Practices in Delta Lake Delta Lake is a great solution for storing and managing data. However, following the right methodology is crucial to avoid bad database design and performance bottlenecks. Here are some best practices you can follow while designing Delta Lake tables: <table style="font-size: 15px; border-collapse: separate;" cellspacing="10" width="100%"> <th style="background-color: #cad5d7; width: 30%; padding: 10px; border-bottom: none;">Practice</th> <th style="background-color: #cad5d7; width: 70%; border-bottom: none;">Description</th> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">Right Partition Column</td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">When designing a Delta Lake table, selecting the appropriate column partition is crucial to enhance its read performance. Avoid using high cardinality columns, like ‘transactionId’ in a sales table or ‘orderId’ in an order table, as your partition choice. These columns have many possible values, leading to potential performance issues. Instead, consider using columns like Year, Month, or Date for partitioning. These are often employed in query predicates and exhibit low cardinality, making them a more efficient choice.</td></tr> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">Compact Files</td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">When a table experiences regular write operations in small batches, it can gradually accumulate a large number of files within the storage layer. This accumulation can negatively impact the efficiency of your read operations. Any queries performed on Delta Lake would involve scanning and reading numerous small files, which can lead to slower processing times. Therefore, it’s recommended to streamline Delta Lake tables periodically. You can achieve this by reducing an excessive number of files into a smaller set by executing the ‘OPTIMIZE’ command.</td></tr> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">Overwrite Instead of Delete</td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">Delta Lake’s overwrite operations are significantly quicker than its delete functions. These operations are atomic, guaranteeing that previous table versions remain accessible and retrievable through the Time Travel feature. It’s advisable to avoid accessing the storage layer directly and deleting the underlying files, as this action will permanently erase data that cannot be recovered. Notably, the data storage layer is a cost-effective solution that operates independently of compute pricing. </td></tr> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">Z-Order for Large Tables</td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;">If a table contains a vast dataset and features a column often used in query predicates with high cardinality, applying Z-Order to the data can substantially enhance read performance. You can do this by grouping rows from the same dataset rather than scattering them across all files. Columns like ‘transactionId,’ ‘orderId,’ or ‘userId’ are excellent examples of Z-Order in Delta Lake. This approach is akin to compacting files but contrasts with column partitioning.</td></tr></table> Conclusion You now know what performance optimization techniques to apply when building a Delta Lake table. However, remember that these optimizations need a Databricks compute engine, which can incur high costs. You should choose what works best for your organizational needs and determine the best trade-off between performance and cost. Databricks Delta Lake is an excellent tool for small to large enterprises’ data needs and is one of the most popular and reliable big data solutions.

Home / Resources / Blogs / Azure Databricks Delta Lake Best Practices

In this modern era, data has become a game-changer. It helps organizations make critical and informed business decisions.

While many companies collect data through various sources, the most challenging part, the differentiator, is to store and manage the data effectively. Given that the escalating volume of data is becoming increasingly challenging to handle, there is a notable shift from traditional databases towards more contemporary architectural layers, including cloud-based solutions, data lakes, and delta lakes.

What is Delta Lake in Databricks?

Delta Lake is an open-source storage framework that stores data and tables in Parquet files. Delta Lake is compatible with Apache Spark, which supports batch and stream data processing. Azure Databricks uses Delta Lake as its default storage and refers to its tables as Delta tables.

The underlying storage layer for Databricks Delta Lake can be AWS S3, GCP GCS, or Azure BLOB. This article discusses Databricks Delta Lake implementation in Azure and shares best practices to optimize the reads and writes in Delta Lake for better efficiency.

Benefits of Using Delta Lake

The Databricks Delta Lake implementation has many benefits. It is built on a hybrid model that embraces data warehousing and Spark multi-processing principles.

A typical implementation would provide support for:

ACID transactions: Delta Lake supports ACID transactions (Atomicity, Consistency, Isolation, and Durability), which provides consistent data and transactional integrity.
Scalable: All metadata operations in Delta Lake use an Apache Spark compute engine where data is scalable. This ensures Delta Lake can handle enormous amounts of data without a performance bottleneck.
Security and Governance: Delta Lake provides secure and reliable storage of data, which has fine-grained access control for all underlying data at various levels viz, table, row, and column. Security is easy to implement and protects sensitive data from unauthorized access.
Time Travel: Imagine if your business suffered from data corruption or incorrect data updated to a sizeable transactional table. In a traditional database, restoring integrity can cost a substantial amount of time and money. However, Delta Lake provides a version history feature that helps prevent data corruption. The table data can roll back to previous versions by running a simple command in Databricks Notebook.
Data Sharing: Delta tables can be easily and securely shared with other users. It is platform-independent and supports recipients from outside your organization as well.
Live Tables: You can use the Delta Lake Live Tables feature to build reliable and sustainable data processing pipelines. Live Tables provides an end-to-end solution for your ETL (Extract, Transfer, Load) pipeline. This feature supports data selection, transformation, quality control, automated testing, and monitoring.

For many people, the difference between “Delta Lake” and “Data Lake” is not very clear. Our blog, “Delta Lake vs. Data Lake - Which One Provides High-Quality Data You Can Trust?”.

Implementing Delta Lake

The strength of Delta Lake lies in its compatibility with various programming languages, like Spark SQL, Scala, and Python. To get started, it requires specific prerequisite libraries to be installed, depending on the programmer’s preference. Once these libraries are in place, the process becomes quite straightforward. It makes establishing a Delta table as easy as executing a Data Definition Language (DDL) statement, just like any relational database.

Click here to explore more on implementing Delta Lake.

Here’s a skeleton statement to create a Delta table:


					CREATE TABLE delta. <delta-table-name>  
<columns, datatype>  
USING DELTA OPTIONS  
(path <Azure blob storage path>)

Databricks Delta tables support common Data Manipulation Language (DML) operations such as SELECT, INSERT, UPDATE, DELETE, and UPSERT, which you can write in SQL, Python, Scala, and Java.

Delta Lake processing is powered by a Databricks compute engine, which is part of a Databricks cluster. The cluster needs to be up and running to read or write data from the Delta Lake tables. You can use the same cluster to run any Python, Scala, or SQL commands in the Databricks notebooks.

Optimizing and Implementing Best Practices in Delta Lake

Delta Lake is a great solution for storing and managing data. However, following the right methodology is crucial to avoid bad database design and performance bottlenecks.

Here are some best practices you can follow while designing Delta Lake tables:

Practice	Description
Right Partition Column	When designing a Delta Lake table, selecting the appropriate column partition is crucial to enhance its read performance. Avoid using high cardinality columns, like ‘transactionId’ in a sales table or ‘orderId’ in an order table, as your partition choice. These columns have many possible values, leading to potential performance issues. Instead, consider using columns like Year, Month, or Date for partitioning. These are often employed in query predicates and exhibit low cardinality, making them a more efficient choice.
Compact Files	When a table experiences regular write operations in small batches, it can gradually accumulate a large number of files within the storage layer. This accumulation can negatively impact the efficiency of your read operations. Any queries performed on Delta Lake would involve scanning and reading numerous small files, which can lead to slower processing times. Therefore, it’s recommended to streamline Delta Lake tables periodically. You can achieve this by reducing an excessive number of files into a smaller set by executing the ‘OPTIMIZE’ command.
Overwrite Instead of Delete	Delta Lake’s overwrite operations are significantly quicker than its delete functions. These operations are atomic, guaranteeing that previous table versions remain accessible and retrievable through the Time Travel feature. It’s advisable to avoid accessing the storage layer directly and deleting the underlying files, as this action will permanently erase data that cannot be recovered. Notably, the data storage layer is a cost-effective solution that operates independently of compute pricing.
Z-Order for Large Tables	If a table contains a vast dataset and features a column often used in query predicates with high cardinality, applying Z-Order to the data can substantially enhance read performance. You can do this by grouping rows from the same dataset rather than scattering them across all files. Columns like ‘transactionId,’ ‘orderId,’ or ‘userId’ are excellent examples of Z-Order in Delta Lake. This approach is akin to compacting files but contrasts with column partitioning.

Conclusion

You now know what performance optimization techniques to apply when building a Delta Lake table. However, remember that these optimizations need a Databricks compute engine, which can incur high costs. You should choose what works best for your organizational needs and determine the best trade-off between performance and cost. Databricks Delta Lake is an excellent tool for small to large enterprises’ data needs and is one of the most popular and reliable big data solutions.

Ravi Prakash Keshri

Sr. Data Engineer - IA

Published Nov 28 2023

GSPANN for Azure Databricks

Azure Databricks Delta Lake Best Practices

You May Also Like

Blog

Case Study

White Paper

Blog

Case Study

White Paper

Blog