OBenner / data-engineering-interview-questions Public
More than 2000+ Data engineer interview questions.
811 stars 291 forks Branches Tags Activity
Star Notifications
Code Issues Pull requests Actions Projects Security Insights
master 2 Branches 0 Tags Go to file Go to file Code
khoramism Feat: Adding the mongodb section (#13) 2 days ago
content Feat: Adding the mongodb section (#13) 2 days ago
img Feature/next updates (#4) 2 years ago
.gitignore init 3 years ago
README.md Feature/next updates (#6) last year
README
More than 2000+ questions for preparing a Data Engineer interview.
Full list of questions
Interview questions for Data Engineer
Databases and Data Warehouses
GitHub Official
Questions Description Useful links
Repo page
Apache Cassandra is a distributed, wide-column store, NoSQL
Awesome Cassandra
Cassandra database management system.
Greenplum is a big data technology based on MPP
Greenplum architecture and the Postgres open source database Awesome Greenplum
technology.
MongoDB MongoDB is a document-oriented database. Awesome MongoDB
HBase is an open-source non-relational distributed
Apache Hbase Awesome HBase
database.
Apache Hive is a data warehouse software project
Apache Hive built on top of Apache Hadoop for providing data Awesome Hive
query and analysis.
Amazon Amazon DynamoDB is a fully managed proprietary Awesome DynamoDB
DynamoDB NoSQL database service. Awesome AWS
Amazon Amazon Redshift
Amazon Redshift is a data warehouse product.
Redshift Utilities Awesome AWS
BigQuery is a fully-managed, serverless data
BigQuery GCP Awesome BigQuery
warehouse.
Bigtable is a fully managed wide-column and key-
Bigtable GCP Awesome Bigtable
value NoSQL database service.
Data Formats
Avro is a row-oriented remote procedure call and
Apache Avro Awesome Avro
data serialization framework.
Apache Apache Parquet is a column-oriented data file format
TODO
Parquet designed for efficient data storage and retrieval.
Delta Lake is a storage framework that enables
Delta building a Lakehouse architecture with compute Delta examples
engines
Big Data Frameworks
Apache Airflow is a workflow management platform
Apache Airflow Awesome Airflow
for data engineering pipelines.
Apache Flume is a distributed, reliable, and available
Apache Flume software for efficiently collecting, aggregating, and TODO
moving large amounts of log data.
Apache Hadoop is a collection of software utilities
Apache that facilitates using a network of many computers to
Awesome Hadoop
Hadoop solve problems involving massive amounts of data
and computation.
Apache Impala is a parallel processing SQL query
Apache Impala engine for data stored in a computer cluster running TODO
Apache Hadoop.
Apache Kafka is a distributed event store and stream-
Apache Kafka Awesome Kafka
processing platform.
Apache NiFi is a software project designed to
Apache NiFi Awesome NiFi
automate the flow of data between software systems.
Apache Spark is unified analytics engine for large-
Apache Spark Awesome Spark
scale data processing.
Apache Flink is unified stream-processing and batch-
Apache Flink Awesome Flink
processing framework.
Kubernetes is a system for managing containerized
Kubernetes Awesome Kubernetes
applications across multiple hosts.
Cloud providers
Amazon web service is an online platform that
Amazon Web
provides scalable and cost-effective cloud computing Awesome AWS
Services
solutions.
Microsoft Microsoft Azure is Microsoft's public cloud
Awesome Azure
Azure computing platform.
Google Cloud Google Cloud Platform is a suite of cloud computing
Awesome GCP
Platform services.
Theory
A data warehouse architecture is a method of
DWH defining the overall architecture of data
Awesome databases
Architectures communication processing and presentation that
exist for end-clients computing within the enterprise.
Data A data structure is a specialized format for
TODO
Structures organizing, processing, retrieving and storing data.
SQL is a domain-specific language used in
programming and designed for managing data held
SQL Awesome SQL
in a relational database management system
(RDBMS).
Data visualization tools/BI
Tableau is a powerful data visualization tool used in
Tableau TODO
the Business Intelligence.
Looker is an enterprise platform for BI, data
Looker Looker applications, and embedded analytics that helps you TODO
explore and share insights in real time.
Superset is a modern
data exploration and
Apache Superset Apache Superset TODO
data visualization
platform
Contribution
Please contribute to this repository to help it make better Any change like new question code improvement
Releases
No releases published
Packages
No packages published
Contributors 4
OBenner Oleg Miagkov
wingkwong աӄա
khoramism Alireza Khorami
piyush-an Piyush