Tech Book
Dell Data Lakehouse Sizing
and Configurations Guide
Optimize Performance and Efficiency with Expert Sizing and Configuration Guidance
for the Dell Data Lakehouse.
Abstract
This comprehensive guide provides essential insights and practical
recommendations for sizing and configuring the Dell Data Lakehouse,
designed for data and IT professionals. With a focus on maximizing
performance and efficiency, readers will gain a deep understanding of
component roles, factors influencing sizing decisions, configuration best
practices, deployment options, performance optimization techniques, and
security considerations. This guide equips users with the knowledge needed
to unlock the full potential of their modern data platform.
Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Table of contents
Executive Summary.............................................................................................................................................................. 3
Introduction........................................................................................................................................................................... 5
Understanding the Dell Data Lakehouse Components................................................................................................6
Sizing Guidelines .......................................................................................................................................................7
Configuration Recommendations...................................................................................................................................................13
Deployment Options..........................................................................................................................................................................14
Performance Optimization................................................................................................................................................... 16
Security and Governance ...........................................................................................................................................24
Conclusion ............................................................................................................................................................................ 25
References............................................................................................................................................................................. 26
The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with re-
spect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright © 2024 Dell Inc. or its subsidiaries. Published in the USA March 2024 [H19974].
Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to
change without notice.
2 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Executive Summary
Overview
The Dell Data Lakehouse provides the best experience for a modern data platform. It is a fully integrated data
lakehouse, built on Dell hardware with a full-service software suite. Its distributed query processing approach enables
organizations to federate data for analytics with minimal data movement. They can then centralize the most important
parts of their data estate into a data lake and still benefit from performant SQL on top of that data. The Dell Data
Lakehouse employs the Dell Data Analytics Engine powered by Starburst, a unique, high performance and distributed
query engine. It enables the discovery, querying, and processing of all enterprise data, irrespective of location. The Dell
Data Analytics Engine reduces data movement and enhances query performance and efficiency.
The Dell Data Lakehouse includes:
• Lakehouse Compute cluster comprising compute hardware, Dell Data Analytics Engine Software and Dell Data
Lakehouse System Software
• Lakehouse Storage cluster
Audience
This document is intended for enterprises with data lakes, or a data lake strategy interested in empowering their
organizations to act more quickly, effectively, and efficiently on their data, as well as modernize to a data lakehouse.
Audience roles include:
• Data and application administrators
• Data engineers
• Data scientists
• Hadoop administrators
• IT decision-makers
A data lakehouse can not only assist traditional analytics customers looking to modernize their data collection but also
help analytics systems to get more value from their data or standardize their data for modern analytics workloads.
3 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Revisions
Date Part Number/Revision Description
March 2024 H19974 Initial release
July 2024 H19974.1 V1.1 feature updates
Note: This document may contain language from third-party content that is not under Dell Technologies’ control and is
not consistent with current guidelines for Dell Technologies’ own content. When such third-party content is updated by
the relevant third parties, this document will be revised accordingly.
Note: This document may contain language that is not consistent with Dell Technologies’ current guidelines. Dell
Technologies plans to update the document over subsequent future releases to revise the language accordingly.
We value your feedback.
Dell Technologies and the authors of this document welcome your feedback on this document. Contact the Dell
Technologies team by email.
Author: Kirankumar Bhusanurmath, DA / AI Specialist | TME, Dell Technologies
Contributors:
Note: For links to other documentation for this topic, see dell.com/datamanagement.
4 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Introduction
Dell Data Lakehouse
The Dell Data Lakehouse is a turnkey solution consisting of the Dell Data Analytics Engine, a powerful federated, and
data lake query engine powered by Starburst, Dell Lakehouse System Software that provides life cycle management and
tailor-made compute hardware all integrated into one. For storing and processing large datasets in open file and table
formats, Dell’s leading S3 storage platforms such as ECS, ObjectScale and PowerScale offer exceptional performance,
reliability and security.
At the core of Dell Data Lakehouse lies the Dell Data Analytics Engine facilitating the discovery, querying, and
processing of enterprise-wide data assets regardless of their physical locations. By reducing data movement
requirements and enhancing query efficiency, the Dell Data Lakehouse sets a new benchmark in data platform
optimization and performance.
Figure 1. Dell Data Lakehouse Diagram
Why Sizing and Configurations Matter
Proper sizing and configuration are important for achieving optimal performance and efficiency in the Dell Data
Lakehouse. They impact a number of important factors that are critical to the success of any data platform, such as:
• Resource Utilization: Adequate allocation of CPU, memory, and storage ensures efficient handling of workloads,
preventing wastage or bottlenecks.
• Performance Optimization: Fine-tuning hardware and software settings enhance processing speed and reduces
latency, facilitating faster analytics and decision-making.
• Scalability: Anticipating future growth enables seamless expansion without compromising performance
or efficiency.
• Cost-Effectiveness: Optimized resource usage minimizes expenses, maximizing the ROI of the data
management platform.
• Reliability and Stability: Proper configuration reduces the risk of system failures, ensuring uninterrupted data
access and business continuity.
5 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
• Data Governance and Security: Secure configurations protect sensitive data, maintaining compliance with
regulations and safeguarding against breaches.
• User Experience: Efficient systems improve productivity and decision-making, enhancing the overall user experience
across the organization.
Prioritizing proper sizing and configuration is vital for optimizing performance, scalability, cost-effectiveness, reliability,
security, and user experience in Dell Data Lakehouse.
Understanding the Dell Data Lakehouse Components
Compute cluster and its role
The Dell Data Lakehouse is designed to handle data processing and analytics tasks with unparalleled efficiency and
reliability. Powered by the Dell Data Analytics Engine, this cluster leverages cutting-edge hardware, including tailor-made
compute hardware based on Dell PowerEdge R660 servers, to ensure optimal performance and scalability.
Its primary role is to facilitate seamless data extraction, analysis, and exploration for users across the organization.
By abstracting the complexities of underlying data sources and formats, the Compute cluster simplifies the data
consumption experience, enabling users to derive actionable insights from their data effortlessly. Whether running
business intelligence queries, analytics workloads, or data science tasks, the Compute cluster provides a robust and
versatile environment for handling diverse data processing requirements. Overall, the Lakehouse Compute cluster plays
a vital role in driving innovation and informed decision-making within organizations by empowering users to harness the
full potential of their data assets.
Storage cluster and its role
The Dell Storage cluster forms the backbone of data storage within the Dell Data Lakehouse, providing robust, scalable,
and secure storage solutions for organizations’ data assets.
Storage cluster consists of multiple storage options, but organizations can choose any one of the following storage
types: Dell ECS, Dell ObjectScale, or Dell PowerScale to be the primary storage cluster. Each of these storage solutions
offers distinct features and benefits, catering to different storage needs and requirements. Therefore, the Lakehouse
Storage cluster can be configured with the chosen storage solution based on the organization’s specific use case and
preferences.
Storage cluster’s primary role is to serve as a centralized repository for all types of data, including structured, semi
structured, and unstructured data, ensuring accessibility and reliability for users across the organization. With rich S3
compatibility and globally distributed architectures, the Storage cluster empowers organizations to support diverse
workloads such as AI, analytics, and archiving at scale while reducing total cost of ownership (TCO). Whether storing
transactional data, log files, or multimedia content, the Storage cluster provides enterprise-grade storage capabilities,
ensuring data integrity, resilience, and compliance with regulatory requirements. By seamlessly integrating with the
Dell Data Lakehouse Compute cluster, the Storage cluster facilitates efficient data processing and analysis workflows,
enabling organizations to derive actionable insights and drive innovation from their data assets.
Overall, the Dell Data Lakehouse Storage cluster plays a critical role in supporting the modern data platform, ensuring
organizations can effectively manage and leverage their data to achieve their business objectives.
Richly Integrated Compute and Storage
The Dell Data Lakehouse seamlessly integrates the Compute and Storage clusters to provide a comprehensive and
modern data platform, while maintaining the ability to independently scale the cluster and avoid any lock-in. Here are
how these components work together:
6 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
• Data Processing and Analytics: The Compute cluster, powered by the Dell Data Analytics Engine, serves as the
processing powerhouse of the platform. It handles data processing and analytics tasks efficiently, leveraging its
high-performance hardware and distributed query processing capabilities. Users can run complex queries and
analytics workloads across diverse data sources, thanks to the Compute cluster’s ability to abstract complexities
and provide a unified querying experience.
• Data Storage and Accessibility: The Storage cluster, consisting of either ECS, or ObjectScale, or PowerScale, offers
secure and scalable storage solutions for all types of data. It acts as the central repository for storing vast amounts
of data in various formats, ensuring accessibility and reliability. With rich S3 compatibility and globally distributed
architectures, the Storage cluster enables organizations to support diverse workloads while reducing total cost of
ownership.
• Unified Data Management: Together, the Compute and Storage clusters form the backbone of unified data
management within the Dell Data Lakehouse. Data can seamlessly flow between the Compute and Storage
clusters, enabling efficient data processing and analysis workflows. Users can explore, analyze, and derive insights
from their data without worrying about underlying complexities or storage constraints.
• Scalability and Flexibility: The integrated nature of the Lakehouse platform allows organizations to scale their
infrastructure seamlessly as data volumes and user demands grow. Whether expanding the Compute cluster for
additional processing power or increasing storage capacity with the Storage cluster, organizations can adapt their
infrastructure to meet evolving business needs without compromising performance or efficiency.
• Data Governance and Security: The Lakehouse platform incorporates robust data governance and security features,
ensuring data integrity, resilience, and compliance with regulatory requirements. Administrators can define
policies, manage access controls, and track data usage across the platform, mitigating risks associated with data
management and ensuring data privacy and security.
Overall, by leveraging the synergies between the Compute and Storage clusters, the Dell Data Lakehouse provides
organizations with a modern and comprehensive Data Lakehouse that empowers them to efficiently manage, analyze,
and derive insights from their data assets, driving innovation and competitive advantage.
Performance Accelerator and its role
Dell Data Analytics Engine’s Warp Speed transparently adds an indexing and caching layer to enable higher performance.
You can take advantage of the performance improvements by enabling Warp Speed. Your DDAE cluster nodes are
provisioned with suitable hardware and configurations to setup Warp Speed utility connector for any catalog accessing
object storage with the Hive, Iceberg, or Delta Lake connector.
The Warp Speed performance acceleration directly impacts cluster sizing, as performance benchmarking shows a 3x
to 5x increase in query performance. As a result, only half the number of DDAE worker nodes will be required to run the
same workload on a Warp Speed-enabled.
Sizing Guidelines
Factors to size Compute Cluster
Workload requirements
When determining the size of the Compute cluster within the Dell Data Lakehouse, several critical factors must be
considered to ensure optimal performance and efficiency. These factors revolve around understanding how workload
requirements impact cluster sizing:
• Query Complexity: The complexity of analytical queries is a key determinant of the required computational
resources. Complex queries involving multiple joins, aggregations, or transformations may necessitate a larger
Compute cluster to process data efficiently and deliver timely results.
7 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
• Data Volume: The volume of data being processed directly influences the computational workload. Large datasets
require more processing power to handle data-intensive operations such as sorting, filtering, and aggregation.
Therefore, organizations must scale the Compute cluster appropriately to accommodate varying data volumes.
• Concurrency: Workloads with high levels of concurrency, characterized by multiple users or applications
accessing the system simultaneously, impose additional demands on computational resources. A larger Compute
cluster can help manage concurrency effectively, ensuring smooth performance and responsiveness across
concurrent queries.
• Query Frequency: The frequency at which queries are executed also impacts cluster sizing. Environments with
frequent ad-hoc queries or real-time analytics demand a Compute cluster capable of handling rapid query
execution without compromising performance. Adjusting the cluster size based on query frequency helps maintain
optimal resource utilization and responsiveness.
Note: Query Frequency and Concurrency can be easily confused with each
other so it’s important to understand the differences. Concurrency is the
number of users who execute queries (all identical or all unique) at the exact
same time. Performance for concurrent users is directly proportional to
CPU and RAM. Query Frequency is how often an identical SQL statement is
executed. Memory cache and table redirection capabilities of the Dell Data
Analytics Engine play a crucial role. The performance of frequent queries is
directly proportional to the amount of memory.
By carefully assessing these workload requirements, organizations can tailor the size of the Compute cluster to meet
specific processing needs while optimizing resource allocation and cost-efficiency. Leveraging the distributed query
processing capabilities of Dell Data Analytics Engine, adjustments to the Compute cluster size can be dynamically
made to adapt to changing workload demands. This flexibility ensures consistent performance and responsiveness
across diverse analytical tasks, empowering organizations to derive maximum value from their data assets.
Future scalability needs
When sizing the Compute cluster within the Dell Data Lakehouse platform, it’s essential to consider future scalability
needs as a crucial factor. Anticipating future growth and scalability requirements ensures that the Compute cluster can
effectively accommodate increasing data volumes and evolving workload demands over time.
• Projected Data Growth: Organizations must assess their projected data growth trajectory to estimate future
computational requirements accurately. As data volumes increase, the Compute cluster must scale accordingly to
maintain optimal performance and efficiency. By understanding the expected rate of data growth, organizations can
plan for additional computational resources to scale the Compute cluster proactively.
• Workload Expansion: Future scalability needs also encompass the expansion of analytical workloads and business
requirements. As organizations introduce new analytics initiatives, such as advanced analytics, machine learning, or
real-time analytics, the Compute cluster may need to handle more complex and resource-intensive tasks. Scaling
the Compute cluster to accommodate workload expansion ensures that it can meet evolving business needs and
support new use cases effectively.
8 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
By considering future scalability needs, organizations can effectively size the Compute cluster to support long-
term growth and expansion. Leveraging the distributed query processing capabilities of Dell Data Analytics Engine,
adjustments to the Compute cluster size can be made dynamically to accommodate changing scalability requirements.
This proactive approach ensures that the Compute cluster remains responsive, efficient, and scalable, empowering
organizations to derive maximum value from their data assets and drive innovation in their analytics workflows.
Based on all the above factors, Dell has created recommended sizing options for the Dell Data Lakehouse.
Figure 2. Dell Data Lakehouse Sizing recommendation
Factors to size Storage Cluster
Total data capacity requirements
The total data capacity requirements serve as a fundamental determinant in sizing the Storage cluster within the Dell
Data Lakehouse. Here is a detailed breakdown of how this factor influences storage cluster sizing:
• Understanding Current Data Volume: The first step in assessing total data capacity requirements is understanding
the organization’s current data volume. This involves quantifying the amount of data generated and stored
across various data sources, including structured databases, semi-structured data lakes, and unstructured data
repositories. By analyzing the current data volume, organizations can determine the baseline storage capacity
needed to accommodate existing data assets.
• Estimating Future Data Growth: Anticipating future data growth is essential for sizing the Storage cluster effectively.
Organizations must analyze historical data growth trends and consider factors such as business expansion, data
acquisition initiatives, and regulatory requirements to estimate future data volume. By projecting data growth over
time, organizations can ensure that the Storage cluster provides sufficient capacity to accommodate future data
storage needs without encountering capacity constraints.
• Accounting for Data Variability and Seasonality: Data volume may exhibit variability and seasonality patterns over
time, influenced by factors such as business cycles, marketing campaigns, or seasonal trends. Organizations must
account for these fluctuations when sizing the Storage cluster to ensure that it can handle peak data loads without
performance degradation. By considering data variability and seasonality, organizations can allocate additional
storage capacity to accommodate peak data volumes while maintaining optimal performance and responsiveness.
9 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
• Aligning with Data Storage Policies: Organizations must align storage capacity with data storage policies and
retention requirements. This involves ensuring that the Storage cluster provides sufficient capacity to store data for
the required retention periods dictated by regulatory compliance or business needs. By aligning storage capacity
with data storage policies, organizations can ensure compliance with data retention requirements while optimizing
storage utilization and efficiency.
• Scalability and Flexibility: Building scalability and flexibility into the Storage cluster architecture is essential for
accommodating future data growth and changing storage requirements. Organizations should leverage scalable
storage solutions and architectures that allow for seamless expansion of storage capacity as needed. By adopting
scalable storage solutions, organizations can future-proof the Storage cluster and ensure that it can scale
dynamically to meet evolving data storage needs without disruption.
By carefully considering total data capacity requirements, organizations can effectively size the Storage cluster to
meet current and future storage needs while ensuring scalability, flexibility, and compliance with data retention policies.
Leveraging storage solutions such as Dell ECS, ObjectScale, or PowerScale integrated with Dell Data Lakehouse,
adjustments to the Storage cluster size can be made dynamically to accommodate changing capacity requirements,
ensuring seamless scalability and efficiency of the Storage infrastructure.
Data retention policies
Data retention policies play a crucial role in determining the size of the Storage cluster within the Dell Data Lakehouse.
Here is a detailed exploration of how data retention policies influence storage cluster sizing:
• Defining Retention Periods: Data retention policies specify the duration for which data must be retained within the
system based on regulatory requirements, business needs, and data governance considerations. Organizations
must clearly define retention periods for different types of data, including transactional data, historical records, and
archival data.
• Calculating Storage Capacity Requirements: Once data retention periods are established, organizations can calculate
the storage capacity needed to retain data for the specified durations. This involves estimating the volume of data
generated within each retention period and aggregating storage capacity requirements across all data categories. By
aligning storage capacity with data retention policies, organizations ensure compliance with regulatory requirements
and business mandates while optimizing storage utilization.
• Accounting for Different Data Types: Data retention policies may vary depending on the type of data, with some data
requiring longer retention periods than others. For example, transactional data may have shorter retention periods,
while historical records or compliance-related data may require longer retention periods. Organizations must account
for these differences when sizing the Storage cluster, allocating sufficient storage capacity to accommodate data
with varying retention requirements.
• Implementing Data Lifecycle Management: Effective data retention policies often include data life cycle management
strategies, such as data archiving and deletion practices, to optimize storage utilization and minimize storage costs.
Organizations must consider data life cycle management practices when sizing the Storage cluster, ensuring that it
can support data archiving and deletion workflows efficiently. By implementing data life cycle management practices,
organizations can ensure that the Storage cluster remains lean, agile, and cost-effective over time.
• Scalability and Compliance: Building scalability and compliance into the Storage cluster architecture is essential
for accommodating changing data retention policies and regulatory requirements. Organizations should leverage
scalable storage solutions and architectures that allow for seamless expansion of storage capacity as needed
while ensuring compliance with data retention mandates. By adopting scalable and compliant storage solutions,
organizations can future-proof the Storage cluster and ensure that it can adapt to evolving data retention policies and
regulatory changes without disruption.
By carefully considering data retention policies, organizations can effectively size the Storage cluster to meet
compliance requirements, optimize storage utilization, and support efficient data life cycle management practices.
10 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Leveraging storage solutions such as Dell ECS, ObjectScale, or PowerScale integrated within Dell Data Lakehouse,
adjustments to the Storage cluster size can be made dynamically to accommodate changing data retention policies
and regulatory requirements, ensuring seamless scalability and compliance of the Storage infrastructure.
Replication and redundancy need
Considering replication and redundancy needs is vital when sizing the Storage cluster within the Dell Data Lakehouse.
Here is a detailed exploration of how these factors influence storage cluster sizing:
• Ensuring Data Durability and Resilience: Replication and redundancy mechanisms are essential for ensuring data
durability and resilience against hardware failures, data corruption, or other unforeseen events. By maintaining
redundant copies of data across multiple storage nodes or data centers, organizations can minimize the risk of data
loss and ensure continuous availability of critical data assets.
• Determining Replication Factors: Organizations must determine the appropriate replication factors based on their
redundancy requirements and resilience objectives. This involves specifying the number of redundant copies of data
to be maintained and the distribution of replicas across storage nodes or data centers. Higher replication factors
provide greater redundancy and resilience but may require additional storage capacity to accommodate redundant
copies of data.
• Calculating Storage Capacity for Replication: Replication factors directly impact storage capacity requirements, as
each additional replica consumes storage space. Organizations must calculate the storage capacity needed to
accommodate redundant copies of data based on the specified replication factors. This involves multiplying the
original data volume by the replication factor to determine the total storage capacity required for replication.
• Balancing Redundancy and Storage Costs: Balancing redundancy needs with storage costs is essential for
optimizing storage utilization and minimizing infrastructure expenses. Organizations must strike a balance between
achieving the desired level of data redundancy and minimizing storage overhead. This may involve evaluating trade-
offs between replication factors, storage efficiency techniques, and cost-effective storage solutions to optimize
redundancy while controlling storage costs.
• Scalability and Flexibility: Building scalability and flexibility into the Storage cluster architecture is crucial for
accommodating future replication needs and evolving data protection requirements. Organizations should leverage
scalable storage solutions and architectures that allow for seamless expansion of storage capacity and replication
capabilities as needed. By adopting scalable and flexible storage solutions, organizations can ensure that the Storage
cluster can adapt to changing replication needs and support data protection strategies effectively over time.
By carefully considering replication and redundancy needs, organizations can effectively size the Storage cluster to
ensure data durability, resilience, and continuous availability of critical data assets. Leveraging storage solutions such
as Dell ECS, ObjectScale, or PowerScale integrated within Dell Data Lakehouse, adjustments to the Storage cluster size
and replication configurations can be made dynamically to accommodate changing redundancy requirements and
ensure seamless scalability and resilience of the Storage infrastructure.
11 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Growth projection over time
Considering growth projections over time is essential when sizing the Storage cluster within the Dell Data Lakehouse.
Here is a detailed exploration of how this factor influences storage cluster sizing:
• Anticipating Data Growth Trends: Organizations must analyze historical data growth trends and project future data
growth patterns to estimate storage capacity requirements accurately. By understanding data growth trends over
time, organizations can anticipate the rate of data expansion and plan for additional storage capacity accordingly.
• Forecasting Future Storage Needs: Forecasting future storage needs involves predicting the volume of data that
will be generated and stored within the organization over time. This includes considering factors such as business
expansion, data acquisition initiatives, and regulatory requirements that may influence data growth. By forecasting
future storage needs, organizations can ensure that the Storage cluster provides sufficient capacity to accommodate
future data growth without encountering capacity constraints.
• Scaling Storage Capacity Proactively: Proactively scaling storage capacity is essential for accommodating future
data growth and ensuring seamless scalability of the Storage cluster over time. Organizations should allocate
additional storage capacity beyond current requirements to account for projected data growth and avoid the need for
frequent expansions or upgrades. By scaling storage capacity proactively, organizations can future-proof the Storage
cluster and ensure that it can adapt to evolving data storage needs without disruption.
• Adapting to Changing Requirements: Building flexibility into the Storage cluster architecture is crucial for adapting
to changing data storage requirements and growth projections over time. Organizations should leverage scalable
storage solutions and architectures that allow for seamless expansion of storage capacity as needed. By adopting
flexible storage solutions, organizations can ensure that the Storage cluster remains agile and responsive to changing
data growth trends and business needs.
• Aligning with Business Objectives: Aligning storage capacity with business objectives is essential for ensuring that
the Storage cluster supports organizational growth and strategic initiatives effectively. Organizations should consider
how future data growth may impact business operations, analytics initiatives, and decision-making processes. By
aligning storage capacity with business objectives, organizations can ensure that the Storage cluster provides the
necessary resources to support data-driven innovation and growth.
By carefully considering growth projections over time, organizations can effectively size the Storage cluster to
accommodate future data growth and support long-term scalability and resilience. Leveraging storage solutions such as
Dell ECS, ObjectScale, or PowerScale integrated within Dell Data Lakehouse, adjustments to the Storage cluster size can
be made dynamically to accommodate changing growth projections and ensure seamless scalability and efficiency of the
Storage infrastructure.
12 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Configuration Recommendations
Best practices for configuring the Compute cluster:
Hardware Specifications
Understanding Compute Pre-Configured Nodes Customer Benefit Performance Accelerator
Nodes
Both Control Plane Customers can rest Analytics Engine nodes
The Dell Data Lakehouse and Analytics Engine assured that their Compute are provisioned with the
Compute cluster consists nodes are meticulously cluster is equipped with necessary hardware and
of two distinct node types: prebuilt and configured optimized hardware nodes, configuration to enable Warp
Control Plane nodes and by Dell implementation meticulously crafted by Speed, which increases
Analytics Engine nodes. engineers. These nodes Dell experts. This prebuilt performance by 100%. It is
Control Plane nodes are are meticulously optimized infrastructure eliminates the recommended to enable
dedicated to running the to ensure seamless need for customers to delve the Warp Speed feature on
Lakehouse system software, compatibility and optimal into hardware configurations, the cluster to significantly
while Analytics Engine nodes performance with the allowing them to focus their enhance performance and
handle data processing and Lakehouse platform. efforts on maximizing the accommodate double
analytics tasks. value of their data assets. the workloads within
the same SLA.
Network configurations
Network configuration is seamlessly integrated into the prebuilt nodes of the Lakehouse solution, ensuring
comprehensive networking capabilities out of the box. Dell’s implementation engineers meticulously design and configure
the networking infrastructure within the Dell Data Lakehouse, eliminating the need for customers to undertake any
networking configurations. This includes setting up network connectivity between Control Plane, Analytics Engine nodes
and storage nodes, as well as optimizing network settings for high-speed data transfer and low latency. By incorporating
networking into the prebuilt nodes, customers benefit from a hassle-free deployment experience, allowing them to focus
on leveraging the Dell Data Lakehouse to drive insights and innovation from their data assets.
Software dependencies and versions
All essential software dependencies and versions are prepackaged within the prebuilt nodes of the Dell Data Lakehouse,
encompassing both infrastructure components and Lakehouse management software. Dell’s implementation engineers
meticulously curate and integrate the required software stack to ensure seamless compatibility and optimal performance
of the Dell Data Lakehouse. This includes operating systems, database management systems, analytics frameworks,
and other critical software components necessary for the operation and management of the Lakehouse environment.
By prepackaging software dependencies and versions, customers benefit from a streamlined deployment process,
eliminating the complexities associated with software installation and version management. This approach enables
customers to leverage the full capabilities of the Lakehouse from day one, empowering them to focus on deriving insights
and value from their data assets without the burden of software configuration tasks.
13 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Configuration Recommendations
Best practices for configuring the Lakehouse Storage cluster:
The Storage cluster within the Dell Data Lakehouse plays a critical
Storage type nodes: All Flash, role in storing and managing vast volumes of data efficiently. Here
Hybrid or Archival are the configuration recommendations tailored to ensure optimal
performance and scalability:
Capacity dense All Flash nodes:
• Multi-PB high-density Data Selecting the primary storage type
Lakehouse Choose the primary storage type for the Dell Data Lakehouse from
• Multitenant, no compromise on options such as Dell ECS, ObjectScale, or PowerScale. Evaluate
performance each storage solution based on factors such as performance,
• High-performance data discovery scalability, resilience, and cost-effectiveness to determine the most
and visualization suitable option for your organization’s needs.
Capacity sparse All Flash nodes:
• Flash Ter for existing cluster Data replication and backup strategies
• Small edge cluster
• Implementing Data Replication: Configure data replication
Capacity dense Hybrid nodes: strategies to ensure data durability and resilience against
• Capacity centric hardware failures or data loss events. Determine the
• Migration target for existing appropriate replication factors and distribution of data replicas
clusters across storage nodes or data centers to achieve the desired
• Data discovery and visualization level of redundancy and resilience.
Capacity sparse Hybrid nodes: • Developing Backup Policies: Establish backup policies and
schedules to safeguard critical data assets against accidental
• Starter pack: Lower storage initial
deletion, corruption, or other unforeseen events. Define
deployment, lower performance
backup retention periods and recovery objectives based on
Archival nodes: data criticality and business continuity requirements to ensure
• Archive for traditional Data Lake timely and reliable data restoration.
cluster. Only as a cold tier.
Deployment Options
Deploying the Dell Data Lakehouse involves strategic considerations for on-premises deployment, leveraging its prebuilt
nature and Dell’s expertise in deployments and servicing. Here is how organizations can benefit from on-premises
deployment with minimal overhead:
Guidance on On-Premises Deployment
Prebuilt Infrastructure
The Dell Data Lakehouse arrives as a prebuilt cluster, meticulously assembled and configured by Dell’s expert engineers.
This turnkey solution streamlines the deployment process, ensuring rapid setup and minimal disruption to your operations.
With Dell handling the deployment, organizations can focus their efforts on leveraging the platform’s capabilities rather
than managing deployment logistics.
14 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Integration Exploration
While Dell Data Lakehouse primarily targets on-premises deployment, organizations may explore hybrid cloud integration
scenarios to extend their data analytics capabilities. Dell offers seamless integration support, enabling organizations to
bridge their on-premises Lakehouse deployment with cloud-based services for enhanced scalability and flexibility.
Integration with Other Dell Data Lakehouse through Data Analytics Engine’s
Stargate Integration
The Dell Data Lakehouse offers seamless integration capabilities with other Dell Data Lakehouse instances through Dell
Data Analytic Engine’s Stargate Integration. This integration leverages the power data federation and query federation
across multiple Lakehouse deployments. Here’s how organizations can benefit from this integration:
• Data Federation: Stargate Integration facilitates data federation by allowing Lakehouse instances to access and
query data from each other seamlessly. Organizations can unify their data assets across distributed Lakehouse
deployments, enabling comprehensive data analysis and insights generation.
• Query Federation: With query federation capabilities, organizations can execute queries that span multiple Lakehouse
instances, leveraging the combined data resources for enhanced analytics. This enables organizations to break down
data silos and gain holistic insights from their data assets, regardless of their location or deployment environment.
• Unified Data Access: By integrating multiple Lakehouse instances through Stargate Integration, organizations can
provide unified data access to their users and applications. This enables users to query and analyze data from
diverse sources without the need for data movement or duplication, streamlining data access and enhancing
collaboration.
• Scalability and Flexibility: Stargate Integration enhances the scalability and flexibility of Lakehouse deployments
by enabling seamless expansion across distributed environments. Organizations can scale their data analytics
capabilities horizontally by adding new Lakehouse instances and integrating them with existing deployments,
ensuring scalability and flexibility to meet evolving business requirements.
• Optimized Performance: By federating queries across multiple Lakehouse instances, organizations can
optimize query performance and resource utilization. Dell Data Analytics engine ensures efficient query execution and
resource allocation, enabling organizations to achieve high performance and responsiveness for their data analytics
workloads.
In summary, integration with other Dell Data Lakehouses through Dell Data Analytics Engine’s Stargate Integration
empowers organizations to unlock the full potential of their data assets by facilitating data federation, query federation,
unified data access, scalability, flexibility, and optimized performance across distributed Lakehouse deployments. This
integration paves the way for comprehensive data analysis, insights generation, and decision-making, driving innovation
and competitive advantage for organizations leveraging the Lakehouse platform.
15 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Performance Optimization
General Strategy
One favored approach is to begin by creating a sizable cluster, larger than initially estimated, and gradually fine-tune its
performance. By provisioning ample resources upfront and focusing initially on stability rather than resource efficiency,
organizations can establish a solid foundation for their Dell Data Lakehouse clusters. This method allows for a smoother
onboarding process of data sources and users, providing a cushion of resources to accommodate growth and changes.
However, it’s worth noting that while this strategy simplifies the initial setup and tuning process, it may not always
align with budget constraints. Nonetheless, it serves as a practical starting point for sizing and configuring Dell Data
Lakehouse’s Analytics Engine clusters effectively. Throughout this guide, we’ll delve into more detailed strategies for
determining cluster size and optimizing performance to meet specific requirements.
Baseline advice
The following advice, derived from extensive research and troubleshooting of Trino clusters, applies to data lakehouse
environments utilizing prebuilt nodes with pre-tuned and configured hardware and software layers that align with the
baseline advice. Enabling spilling at the OS level is a common misstep that’s incompatible with Trino’s design. Spilling
involves writing memory to disk, disrupting performance due to the compacting garbage collector used by JVM and
leading to network congestion, especially in cloud environments. Disabling spilling is imperative to maintain cluster
stability. Additionally, Kubernetes’ Runaway process detector, designed to prevent excessive CPU usage, can misinterpret
Trino’s CPU-intensive nature, leading to process termination. Therefore, it’s crucial to adjust these settings accordingly.
Regular upgrades are also essential for staying ahead of performance issues and security vulnerabilities. With frequent
releases containing critical fixes and enhancements, it’s advisable to plan for upgrades at least once a quarter to ensure
optimal cluster performance and security compliance. The good news is that the Dell Data Lakehouse comes with these
tunings out of the box.
Cluster sizing – CPU and Memory
When sizing Dell Data Analytics Engine cluster, focus on CPU and memory allocation. CPU handles data processing tasks,
like joins and aggregations, while more CPU usually means faster queries. Memory is binary: either there is enough for
the query, or there isn’t. More CPU without enough memory limits concurrent processing. DDAE’s CPU requirements per
query are stable, making resource planning easier. However, adding more CPU doesn’t always lead to linear performance
improvements due to resource contention. Memory allocation is crucial, especially for operations like joins and group by.
Optimize query plans to minimize memory usage. Determining the right cluster size depends on workload characteristics
and user expectations. Engage with vendors or Dell Experts for systematic cluster sizing or take an iterative approach,
scaling resources based on observed workload demands. Keep monitoring for optimal performance. Cluster sizing may
seem straightforward but requires careful planning and ongoing optimization for the best Dell Data Lakehouse cluster
performance.
Machine Sizing
Given the total CPU and memory requirements, the challenge is to allocate these resources across individual machines.
Memory
Understanding memory allocation for DDAE on machines is essential. Machines have various layers of memory allocation,
including the kernel, network buffers, and system processes. The Java Virtual Machine (JVM) manages memory for DDAE,
but it’s not solely determined by the heap size (Xmx). Memory is also allocated for thread stacks, native code structures,
garbage collector data structures, and native buffers for IO operations. Configuring JVM options is crucial for optimizing
memory allocation, especially for controlling the size of native buffers. Additionally, heap memory caters to user queries,
but a significant portion remains unattributed or untracked, including shared buffers and temporary result pages.
Understanding these nuances is vital for effective memory management and optimizing DDAE performance.
16 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Figure 3. Machine Memory stack of DDAE
Memory allocations
Allocate 80% of available memory to the JVM heap for DDAE, reserving 20% for the OS and other system overheads.
Within the JVM heap, assign 70% to DDAE’s query memory pool and the remaining 30% for shared resources and
untracked memory. It’s worth noting that DDAE nodes come prebuilt with recommended memory allocation configurations,
ensuring optimal performance out of the box. Compared to distributed frameworks like YARN and Spark, DDAE’s single
JVM architecture enables efficient memory sharing across queries, potentially leading to better memory utilization.
Shared Join Hash
In DDAE, joins are executed by building a large hash table, a method unique in the Big Data space. Multiple threads
share this hash table, enabling efficient probing. Each thread reads data into a buffer and performs lookups in the hash
table, such as converting zip codes to city names. This parallel processing allows for scaling up or down the number of
threads based on workload. The key advantage lies in leveraging memory efficiently; with faster processing, memory can
be cleared more rapidly for reuse, maximizing its utilization. Hence, speed directly impacts memory efficiency, as slower
operations tie up memory longer, necessitating additional memory resources.
17 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Distributed join
In a distributed join scenario, multiple DDAE machines are involved. The smaller table is read by these machines, and each
row is hashed by its key and distributed across the available machines. For instance, if the distribution is based on user
ID and there are two machines, user one’s data goes to one machine, user two’s to another, and so forth. Parallel threads
then construct hash tables based on this distribution.
On the probe side, the same hash key is applied, directing related data to the same machines. This enables multi-threaded
processing for efficient join operations. Ultimately, each machine holds a fraction of the total rows, determined by the
number of machines involved.
In DDAE cluster, each machine undertakes specific tasks, which can be distributed across the same or different machines,
and executed in parallel threads. As a result, each machine ends up handling a portion of both the build and probe rows.
Skew
Skew occurs when data distribution is uneven, disrupting the balance of processing resources across DDAE machines.
For example, assigning user ID zero to logged-out users can lead to disproportionate data allocation, causing memory
overflow and query failures on some machines while others remain underutilized. Skewed distribution can also result
in some DDAE worker nodes processing data much slower than others, leading to query delays and increased resource
usage. While skew can sometimes benefit operations like group-by by spreading out partial aggregation tasks, it can
exacerbate issues with certain types of aggregations, such as arrays.
Use bigger machines
Unlike systems favoring numerous small machines, adopting larger machines offers significant advantages.
Consolidating your total cluster requirement into fewer, more robust machines reduces memory overhead per machine.
This stems from fixed-size data structures at both the OS and JVM levels, which don’t scale proportionally with machine
size. The primary advantage lies in mitigating issues such as skew and scheduler inefficiencies.
Larger machines, equipped with more cores, compensate for scheduling inaccuracies by efficiently handling additional
workloads. Consequently, errors in workload distribution become less impactful, particularly when addressing skew. By
allocating more resources to process skewed data, larger machines ensure swift query completion, mitigating the impact
of data imbalance.
Moreover, using bigger machines reduces overhead associated with operations like broadcast joins, streamlining resource
allocation and enhancing system efficiency. Dell Data Lakehouse nodes are prebuilt with hardware and software, offer
64 vCPUs and 256GB Memory on each worker and coordinator node, providing robust infrastructure for data-intensive
workloads. These nodes are optimized for balance of query completion and query performance based on extensive
expertise in Trino workloads.
Machine types
The machine types used in the Dell Data Lakehouse closely resemble larger machines types, featuring 256GB of memory
and 64 virtual cores each. Notably, other machine types such as those available on the public cloud typically align network
capabilities with core count, crucial for workload distribution. However, opting for fewer cores may result in network
limitations, hindering data processing efficiency. Balanced and larger machines are preferred to mitigate such issues.
The worker and coordinator nodes of the Dell Data Lakehouse nodes are built with ample resources as recommended in
the above.
18 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Additional thoughts
Hash join vs (sort) merge join
Sort merge joins are advantageous only when sorting isn’t required, as sorting is resource-intensive and typically involves
disk usage for temporary storage. Despite being efficiently executed by systems like Hive and Spark due to their inherent
support for map reduce, sort merge joins face challenges with parallelism and dynamic workload division. In contrast,
hash joins, as implemented by DDAE, offer greater flexibility and scalability through distributed hash tables, making them
the preferred choice for efficient join operations.
Spilling
Spilling is a technique used in join operations where part of the hash table is temporarily stored on disk, along with
matching rows, to alleviate memory constraints. However, spilling has significant drawbacks, as disk access is slow
and SSD storage is costly. Spilling across networks exacerbates the problem. Moreover, writing to disk consumes CPU
resources and slows down queries, frustrating users and wasting time. Instead of investing in disks, it’s more cost-
effective to add more memory to the cluster or scale up by adding more machines. Analyzing and optimizing large queries
to reduce memory usage is also advisable to minimize the need for spilling. To address Spilling DDAE nodes are packed
with larger memory 256GB and also SSDs as disk if any spilling happens. It is also recommended to add more DDAE
nodes to eliminate performance issues.
Small clusters
Small clusters, defined as those capable of running only a few queries at a time, present unique challenges. They are
susceptible to performance issues if the workload grows beyond their capacity. For instance, if a query grows in size,
it can hit the cluster’s limits, causing delays or failures. Moreover, small clusters struggle with concurrency, leading to
queuing problems when multiple users submit queries simultaneously. Issues like skew can exacerbate problems in such
constrained environments.
Handling small clusters effectively requires careful planning and management. Ideally, aim for a diverse mix of concurrent
workloads to fully utilize the cluster’s resources. However, if workload growth is anticipated or experienced, scaling out
by adding more nodes is recommended. Additionally, consider implementing phased execution strategies to optimize
resource utilization and mitigate memory constraints.
While small clusters pose challenges, DDAE nodes offer larger memory configurations of 256GB and SSD disks, ensuring
optimal performance even in constrained environments.
Tuning the workload
Tuning the workload involves optimizing query performance and resource utilization in the cluster. This can be achieved by
reducing the workload’s computational complexity to make queries faster and more efficient. Additionally, implementing
resource sharing mechanisms ensures fair allocation of resources among users. These strategies improve overall cluster
efficiency and user satisfaction.
Query plan
Understanding query plans is crucial as they can significantly impact query performance, potentially reducing execution
time from hours to minutes. Factors such as join order and join type (e.g., right side join, left side join, bushy join) can
drastically affect performance. Choosing between distributed joins and broadcast joins can also have a substantial
impact on query execution time.
In addition to optimizing the query plan, leveraging advanced SQL constructs can improve efficiency. Knowledge of
constructs like grouping sets, top and row number functions, and efficient filtering techniques (e.g., count of a column
with filtering) can lead to significant performance gains. Utilizing approximate functions for operations like count distinct
can also enhance query performance while maintaining reasonable accuracy.
19 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Pre-computation is another valuable technique for optimizing queries, allowing certain computations to be performed in
advance to reduce query execution time. By incorporating these strategies, query performance can be greatly improved,
resulting in faster and more efficient data processing.
Precomputing
Precomputing is a powerful technique for optimizing queries, especially for expected or recurrent queries with common
patterns. One approach involves running the query once and storing the result in a fast store, which can be reused instead
of rerunning the query each time. For instance, if a query is regularly executed, a scheduled job can precompute the
results for future use.
Another straightforward method is to optimize the source tables of common queries by storing them in a faster format.
For example, switching from text files with JSON to more efficient formats like ORC files can significantly improve query
performance by eliminating the need for parsing large JSON documents.
For more complex queries, precomputing expensive parts of the query and storing them can be beneficial. By identifying
and computing costly components of the query separately and storing them for reuse, overall query performance can
be enhanced. However, this approach may require modifications to the application to reference the precomputed results
effectively.
Additionally, storing partial aggregations or rollups can be valuable for optimizing dashboards or reporting systems.
By computing and storing aggregated data periodically, queries can be simplified, resulting in improved performance.
Although powerful, this technique often entails substantial changes to the application and dashboard logic.
While some may consider sampling data as a means of optimization, it is a challenging task that requires a deep
understanding of statistics. Sampling queries can lead to inaccurate results if not executed properly, making it a risky
optimization strategy. Instead, investing in additional hardware resources may be more effective than attempting complex
sampling methodologies, which often require specialized expertise and significant time investment.
Connectors
Understanding the data store, you’re connecting to is crucial for workload sizing. The most common data source
discussed is Hive, which encompasses object stores like S3 with file formats such as ORC and Parquet. Despite the
misconception that DDAE sends queries to Hive, it reads data directly, making the Hive connector the primary CPU user in
the cluster. This connector is optimized for parallel processing and decompresses data efficiently on the network.
Another common data source is JDBC-based, including MySQL, PostgreSQL, SQL Server, and Redshift. DDAE interacts
with these using their JDBC drivers, which can be slower due to uncompressed data transfer over a single connection.
DDAE has optimized connectors for parallel processing, but there are still limitations on data transfer rates per connection.
While there are other connectors in DDAE, most are either single-threaded or connect to smaller data stores that don’t
heavily utilize CPU. Therefore, the focus of the discussion will primarily be on the Hive connector.
Configuration Recommendations
Organize the data for the Hive connector
When organizing data for the Hive connector in Dell Data Analytics engine, it’s essential to leverage the physical
organization of data from connectors. DDAE can optimize query execution based on the data’s organization, such as
partitioning and sorting. Hive’s unique partitioning strategy involves organizing data into directories based on column
values, commonly used for date partitions. Additionally, bucketing involves hashing data on a column and dividing it into
files. Both partitioning and bucketing strategies can be used simultaneously to optimize query performance in DDAE.
20 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Hive partitioning
Hive partitioning offers significant advantages by reducing scan size, making queries more efficient. In most organizations,
queries typically are focused on specific date ranges, allowing the system to access only relevant directories and files.
This dramatically cuts down query time compared to scanning through all available data.
However, partitioning is most effective for lower cardinality data; for instance, partitioning by Boolean values might
not provide substantial benefits. In DDAE, functions can be applied to partitioned data, leveraging the enumerated values
from the metadata store. While powerful, this approach has limitations when dealing with complex expressions involving
partition columns.
Hive bucketing
Data bucketing is a powerful technique used to divide data into buckets based on specific criteria, such as user IDs.
By assigning each data item to a bucket, queries can be more targeted, reducing the amount of data that needs to be
scanned. In DDAE, enabling bucketed execution allows for even greater efficiency in operations like group by or join. With
bucketed data, operations can be performed on individual buckets without the need for data redistribution across the
cluster, significantly reducing processing costs. However, caution must be exercised to avoid skew, where one bucket
may become disproportionately large, leading to inefficient resource utilization. While bucketing requires additional effort
and is unidirectional, it can significantly improve query performance, especially when combined with file formats like
ORC or Parquet.
ORC and Parquet
ORC (Optimized Row Columnar) and Parquet are file formats designed for efficient data storage and retrieval, particularly
suited for relational databases. While Parquet is more popular, ORC is known to offer dramatically faster performance.
Parquet, originally designed for document stores, may not be as efficient for relational databases but still provides
adequate performance. It’s crucial to compress data to enhance performance, with Zstandard compression being highly
recommended for its superior efficiency in both CPU utilization and read speed. Avoid using algorithms like Snappy or
LZ0 for compression.
Orc and Parquet files are optimized for efficient reading, resulting in faster query execution. They allow for selective
retrieval of specific columns without the need to decompress the entire dataset, making them well-suited for queries
targeting specific data fields. However, having many columns can lead to smaller column sizes, increasing I/O latency and
slowing down the system. Sorting data can help improve performance by facilitating faster data access. Overall,
Orc offers more than 2x performance improvement compared to Parquet in Trino, making it the preferred choice for
optimal performance.
File size
In HDFS clusters, file size is a critical factor affecting performance. In Trino, any file less than 8 megabytes is considered
small, leading to inefficient I/O operations. Both Orc and Parquet readers will load these small files entirely into memory,
negating the benefits of selective data retrieval. While other systems like Hive and Spark can handle small files better,
DDAE’s I/O optimizations are geared towards larger file sizes. Having numerous small files can slow down file listing
operations, especially in S3. Additionally, small files result in poorer compression efficiency. To optimize performance, it’s
essential to avoid small files and aim for reasonably sized ones, preferably around 100 megabytes. Rewriting tables to
consolidate small files can lead to significant performance improvements, sometimes up to 10 times faster.
Bad parquet files
Beware of bad Parquet files, which can cause performance issues. Some systems like Snowflake and Greenplum may
export data to S3 in Parquet format, but the files are internally divided into 4K chunks. This microscopic size can severely
impact performance, especially in HDFS, Hive, and Spark environments. While DDAE’s Trino has optimizations to handle
this better, the files remain inefficient. If receiving dumps from non-Hadoop systems, check them using the Parquet tool
and notify the vendors to address the issue for better performance.
21 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Rewrite table with Trino ORC writer
One effective method to enhance system performance is to rewrite your tables using the DDAE’s Trino ORC writer. Unlike
the Hive or Spark writers, our custom ORC writer is optimized for easy file reading, resulting in better file organization and
improved performance. Additionally, automatic statistics collection is enabled when writing tables, providing valuable
data insights. Consider switching to the Zstandard compression algorithm for better compression efficiency and explore
partitioning and bucketing options for further optimization. While not always feasible for all datasets, rewriting the most
used ones can yield significant performance gains.
Making queries faster
Optimizing query performance is key, and we’ve explored various strategies like analyzing query plans, using approximate
functions, and pre-computing data. Yet, practical considerations are vital too.
Monitor resource usage closely. If CPU usage peaks during high traffic, consider scaling up by adding more machines.
Address network bandwidth constraints promptly, especially in fixed-capacity cloud environments. Similarly, beware of
storage system saturation, which may require expanding storage.
Don’t overlook the Hive Metastore database, as it can also impact performance. Watch its CPU usage closely.
Now, when faced with slow queries, prioritize resource optimization, and consider scaling up resources as needed.
What to look for in a query
When troubleshooting slow queries, start by requesting a concise reproduction of the problematic section instead of
sifting through lengthy query texts. Look out for common issues like excessive DISTINCT clauses or UNIONs without
UNION ALL, which can slow down queries. Regular expressions and JSON parsing are also CPU-intensive operations that
should be minimized.
Check the query plan using `EXPLAIN ANALYZE` to identify costly operations, such as table scans or expensive joins.
Ensure partition filters are correctly applied and watch out for expanding joins, where the number of output rows
far exceeds the input. Evaluate filter conditions, especially those involving JSON processing, for potential
performance bottlenecks.
If the issue persists, focus on expensive functions or operations in the query and experiment with optimizations to
improve performance.
More hardware
Adding more hardware can sometimes be a quick solution to improve performance. Spin up additional machines
temporarily to see if it alleviates the issue. However, there are limits to how much adding hardware can improve
performance, especially if there are bottlenecks in other areas like I/O.
Keep in mind that simply adding more hardware won’t always make queries faster if there are limitations in resource
utilization, such as insufficient splits or cores. Consider techniques like partitioning and bucketing to optimize resource
allocation and avoid the small file problem.
Overall, while adding more hardware can be effective for handling concurrent workloads, it’s essential to understand the
underlying constraints and optimize resource utilization accordingly.
22 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Underutilization
If you find yourself with idle CPU despite having plenty of work, there are common causes to investigate. Often, these
issues stem from external systems being overloaded, such as the Hive Metastore or slow file listing.
• Hive Metastore: Check if the Hive Metastore is slow, which can bottleneck operations like partition listing. Use JMX
stats to pinpoint slow operations and address database overload issues if necessary.
• File Listing: Slow file listing, particularly in S3 or overloaded HDFS name nodes, can cause delays. Utilize JMX stats to
monitor file listing times and optimize file organization if needed.
• Read Speeds: Slow reads are often caused by numerous tiny files or network overload. Ensure files are compressed to
reduce network usage, and monitor JMX stats to diagnose network-related issues.
• Skew: Skew, where certain data is disproportionately distributed, can be challenging to resolve. Consider filtering
skewed data or distributing it evenly to improve performance. Presto doesn’t handle skew automatically, so manual
adjustments may be necessary.
To address underutilization, consider increasing task concurrency, adjusting thread counts, or upgrading to larger
machines. These measures can help maximize resource usage and improve overall system performance.
Hive caching
Caching can mitigate many of the earlier issues but must be used judiciously due to its limitations.
• Metastore Data Caching: Caching Metastore data is effective but may not reflect new tables or partitions until
refreshed. However, it’s a powerful solution for improving performance, especially for Metastore-related issues.
• File Listing Caching: While file listing caching works well, it may miss new files. Yet, for systems employing
partitioning for data management, this limitation is negligible.
• File Data Caching: Exercise caution with file data caching as it’s relatively new and requires local disks on Presto
nodes. Although it can enhance performance, it necessitates additional resources and may not significantly benefit
S3 storage due to its design.
Consider caching options carefully, focusing on addressing specific performance bottlenecks rather than relying on
caching alone. While it can provide substantial improvements, particularly in Hadoop environments, it’s essential to
assess its suitability for your specific use case and infrastructure.
Sharing resources / resource groups
• User-Centric Approach: Prioritize user satisfaction when configuring resource groups, focusing on optimizing their
experience over technical considerations.
• Quick Query Queue: Dedicate a special queue for fast and trivial queries, such as ‘explain,’ ensuring immediate
execution and enhancing user satisfaction.
• Tag-Based Limits: Implement query limits based on user-provided tags, allowing fast queries to run within a shorter
execution time limit.
• Group-Based Resource Allocation: Organize users into groups based on teams or departments and assign resource
limits accordingly, empowering them to manage resources collaboratively.
• User Empowerment: Enable users to terminate each other’s queries, fostering a culture of self-management and
reducing dependence on administrators.
• Psychological Insights: Utilize knowledge of human behavior and social dynamics to optimize resource utilization and
minimize user complaints effectively.
23 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Security and Governance
Security and governance are paramount considerations in any data environment, and the Dell Data Lakehouse product is
no exception. Built on the foundation of Dell Data Lakehouse System Software, Lakehouse is a prebuilt cluster integrates
hardware, network, storage, and software to provide a comprehensive data solution. This section provides an overview of
our recommendations for securing data within the Dell Data Lakehouse environment, compliance considerations for data
governance and privacy regulations, and access control mechanisms and authentication methods.
Recommendations for securing data within the Lakehouse environment
Securing data within a Dell Data Lakehouse involves a comprehensive approach encompassing several key steps:
• Data Identification and Classification: Begin by meticulously identifying and classifying data based on its source, type,
value, and risk level. This initial step lays the foundation for applying tailored security policies and controls to different
data categories, ranging from public to highly confidential. Leveraging the Dell Data Lakehouse for outcomes like
metadata management, data cataloging, and data quality checks facilitates this classification process. Additionally, you
can bring this metadata into data products to promote better reuse by ensuring proper governance and trust in data.
• Encryption of Data: To fortify data security, encrypt both data at rest and in transit. Encryption transforms data into
an unreadable format that requires decryption with a key, rendering it inaccessible to unauthorized users. Employ
encryption software, encryption keys, and robust key management systems to implement encryption effectively,
safeguarding data even in the event of a breach. Dell Data Lakehouse storage cluster support all industry standard
encryption mechanisms to secure the data at rest and in transit.
• Access Control and Authentication Mechanisms: Implement stringent access control and authentication
mechanisms to regulate data access based on predefined rules and user roles. Access control dictates data access
permissions, while authentication verifies the identity of users seeking access. Utilize Dell Data Lakehouse such
as identity and access management, role-based access control (RBAC), and multi-factor authentication to enforce
robust access controls and authentication protocols. Dell Data Lakehouse user management tool provides all
industry standard authentication mechanisms and Dell Data Analytics Engine’s Access Control feature provides the
best RBAC capabilities.
• Data Masking and Anonymization: Protect sensitive information by applying data masking and anonymization
techniques. Data masking obscures or substitutes sensitive data elements with fictitious or random values, while
anonymization alters or removes identifying information from datasets. These measures ensure data privacy
and compliance with regulations like GDPR. Employ data masking software, anonymization algorithms, and
pseudonymization methods to implement these techniques effectively. The Dell Data Analytics Engine’s access
control mechanism also provides row and column level data masking and anonymizations.
• Monitoring and Auditing Data Activity: Continuously monitor and audit data activity and events within the Dell Data
Lakehouse environment. Monitoring involves collecting and analyzing data on performance, health, and usage, while
auditing entails recording and reviewing data access and modification history. This proactive approach helps detect
and prevent anomalies, breaches, or compliance violations. Utilize Dell Data Lakehouse dashboards, logs, alerts, and
reports to facilitate effective monitoring and auditing processes.
• Regular Update and Testing of Data Security Measures: Data security is an ongoing endeavor that requires regular
updates and testing. Continuously refine data security policies and tools to adapt to evolving Dell Data Lakehouse
environments, sources, types, and regulatory standards. Regularly test data security systems and processes to
ensure their effectiveness and reliability. Employ Dell Data Lakehouse tools such as data security audits, reviews,
and simulations to validate the robustness of security measures and identify areas for improvement.
24 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Conclusion
The Dell Data Lakehouse has been developed to address the needs of organizations deploying advanced analytics, and
AI and ML workloads. It incorporates the concepts of a lakehouse architecture along with a container platform using
decoupled compute and storage.
The technical solution guide offers detailed product information and design guidance for Dell Data Lakehouse. It targets
data analytics infrastructure managers and architects. It describes a predesigned, validated, and scalable architecture for
advanced analytics and machine learning on Dell hardware infrastructure. Topics that were discussed include:
• The Dell Data Lakehouse cluster architecture, including cluster server and storage infrastructure and its role
in the system.
• Sizing guidelines like factors to into considerations for sizing compute and storage clusters.
• Configuration recommendations and best practices for configuring hardware, network and software dependencies.
• Deployment options, guidance on on-prem deployment, integration with other Dell Data Lakehouse.
• Performance optimizations around general strategy, baseline advice, cluster zing, machine sizing, tuning workloads,
hive data organizations, making queries faster and sharing resources.
• Security and governances and recommendations.
25 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
References
Dell Technologies documentation
The following Dell Technologies documentation provides other information related to this document. Access to
these documents depends on your login credentials. If you do not have access to a document, contact your Dell
Technologies representative.
Additional information can be obtained at the Dell Technologies Info Hub for Data Analytics. If you need additional
services or implementation help, contact your Dell Technologies sales representative.
Document Type Location
Dell Data Lakehouse Sizing and Configuration Guide
Dell Data Lakehouse
Dell Data Lakehouse Spec Sheet
Server specification sheets PowerEdge R660 Spec Sheet
ECS EX500 Spec Sheet
ECS EXF900 Spec Sheet
Storage specification sheets
ObjectScale Solution Overview
PowerScale H7000 Spec Sheet
PowerSwitch S3100 Series Spec Sheet
Switch specification sheets PowerSwitch S5200-ON Series Spec Sheet
PowerSwitch Z9264F-ON Spec Sheet
Server manuals PowerEdge R660 Manuals and Documents
ECS EX500 Manuals and Documents
ECS EXF900 Manuals and Documents
Storage manuals
ObjectScale Overview and Architecture
PowerScale H7000 Manuals and Documents
PowerSwitch S3100 Manuals and Documents
Switch manuals PowerSwitch S5200-ON Series Manuals and Documents
PowerSwitch Z9264F-ON Manuals and Documents
26 | Dell Data Lakehouse Sizing and Configurations Guide
© Dell Inc. or its subsidiaries.
Delta Lake documentation
The following documentation on the Delta Lake documentation website provides additional and relevant information.
Document Type Location
Lakehouse architecture introductory Lakehouse: A New Generation of Open Platforms that Unify
paper Data Warehousing and Advanced Analytics
Delta Lake project Delta Lake Project website
Delta Lake documentation Delta Lake documentation website
Table 1. Delta Lake Documentation
DDAE documentation
The following documentation on the DDAE documentation website provides additional and relevant information.
Document Type Location
Dell Data Analytics Engine
DDAE reference documentation
Reference Documentation
Table 2. Dell Data Analytics Engine documentation
Apache Iceberg documentation
The following documentation on the Apache Iceberg documentation website provides additional and relevant information.
Document Type Location
Iceberg table format Apache Iceberg
Table 3. Apache Iceberg documentation
Dell Technologies Info Hub
The Dell Technologies Info Hub is your one-stop destination for the latest information about Dell Solutions products. New
material is frequently added, so browse often to keep up to date on the expanding Dell portfolio of cutting-edge products
and solutions.
More information
For more information, including sizing guidance, technical questions, or sales assistance, email
[email protected], or contact your Dell Technologies or authorized partner sales representative.