0% found this document useful (0 votes)
17 views4 pages

2000+ Data Engineering Interview Questions !!

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

2000+ Data Engineering Interview Questions !!

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

OBenner / data-engineering-interview-questions Public

More than 2000+ Data engineer interview questions.

811 stars 291 forks Branches Tags Activity

Star Notifications

Code Issues Pull requests Actions Projects Security Insights

master 2 Branches 0 Tags Go to file Go to file Code

khoramism Feat: Adding the mongodb section (#13) 2 days ago

content Feat: Adding the mongodb section (#13) 2 days ago

img Feature/next updates (#4) 2 years ago

.gitignore init 3 years ago

README.md Feature/next updates (#6) last year

README

More than 2000+ questions for preparing a Data Engineer interview.

Full list of questions

Interview questions for Data Engineer


Databases and Data Warehouses

GitHub Official
Questions Description Useful links
Repo page

Apache Cassandra is a distributed, wide-column store, NoSQL


Awesome Cassandra
Cassandra database management system.

Greenplum is a big data technology based on MPP


Greenplum architecture and the Postgres open source database Awesome Greenplum
technology.

MongoDB MongoDB is a document-oriented database. Awesome MongoDB

HBase is an open-source non-relational distributed


Apache Hbase Awesome HBase
database.

Apache Hive is a data warehouse software project


Apache Hive built on top of Apache Hadoop for providing data Awesome Hive
query and analysis.

Amazon Amazon DynamoDB is a fully managed proprietary Awesome DynamoDB


DynamoDB NoSQL database service. Awesome AWS
Amazon Amazon Redshift
Amazon Redshift is a data warehouse product.
Redshift Utilities Awesome AWS

BigQuery is a fully-managed, serverless data


BigQuery GCP Awesome BigQuery
warehouse.

Bigtable is a fully managed wide-column and key-


Bigtable GCP Awesome Bigtable
value NoSQL database service.

Data Formats

Avro is a row-oriented remote procedure call and


Apache Avro Awesome Avro
data serialization framework.

Apache Apache Parquet is a column-oriented data file format


TODO
Parquet designed for efficient data storage and retrieval.

Delta Lake is a storage framework that enables


Delta building a Lakehouse architecture with compute Delta examples
engines

Big Data Frameworks

Apache Airflow is a workflow management platform


Apache Airflow Awesome Airflow
for data engineering pipelines.

Apache Flume is a distributed, reliable, and available


Apache Flume software for efficiently collecting, aggregating, and TODO
moving large amounts of log data.

Apache Hadoop is a collection of software utilities


Apache that facilitates using a network of many computers to
Awesome Hadoop
Hadoop solve problems involving massive amounts of data
and computation.

Apache Impala is a parallel processing SQL query


Apache Impala engine for data stored in a computer cluster running TODO
Apache Hadoop.

Apache Kafka is a distributed event store and stream-


Apache Kafka Awesome Kafka
processing platform.

Apache NiFi is a software project designed to


Apache NiFi Awesome NiFi
automate the flow of data between software systems.

Apache Spark is unified analytics engine for large-


Apache Spark Awesome Spark
scale data processing.

Apache Flink is unified stream-processing and batch-


Apache Flink Awesome Flink
processing framework.

Kubernetes is a system for managing containerized


Kubernetes Awesome Kubernetes
applications across multiple hosts.

Cloud providers
Amazon web service is an online platform that
Amazon Web
provides scalable and cost-effective cloud computing Awesome AWS
Services
solutions.

Microsoft Microsoft Azure is Microsoft's public cloud


Awesome Azure
Azure computing platform.

Google Cloud Google Cloud Platform is a suite of cloud computing


Awesome GCP
Platform services.

Theory

A data warehouse architecture is a method of


DWH defining the overall architecture of data
Awesome databases
Architectures communication processing and presentation that
exist for end-clients computing within the enterprise.

Data A data structure is a specialized format for


TODO
Structures organizing, processing, retrieving and storing data.

SQL is a domain-specific language used in


programming and designed for managing data held
SQL Awesome SQL
in a relational database management system
(RDBMS).

Data visualization tools/BI

Tableau is a powerful data visualization tool used in


Tableau TODO
the Business Intelligence.

Looker is an enterprise platform for BI, data


Looker Looker applications, and embedded analytics that helps you TODO
explore and share insights in real time.

Superset is a modern
data exploration and
Apache Superset Apache Superset TODO
data visualization
platform

Contribution

Please contribute to this repository to help it make better Any change like new question code improvement

Releases

No releases published

Packages

No packages published

Contributors 4

OBenner Oleg Miagkov

wingkwong աӄա
khoramism Alireza Khorami

piyush-an Piyush

You might also like