Python Cheatsheet — Data Engineer Interview
(Fresher)
Comprehensive, colored, example-driven cheatsheet: core data structures, sorting & searching algorithms,
libraries, and practical snippets you’ll use in interviews.
Contents 1. Core Data Structures & Complexity
2. Collections & Useful Libraries (pandas, numpy, heapq, bisect)
3. Sorting Algorithms (code + analysis)
4. Searching & Graph Traversals (code + examples)
5. Generators, Itertools, File I/O, Concurrency basics
6. Practical Data-Engineer Snippets (CSV/Parquet, Pandas ops, chunking, streaming)
7. Interview Tips & Common Questions
1. Core Data Structures & Complexity
StructureDescriptionCommon Ops (avg)
Structure Description Complexity
List Ordered, mutable, allows duplicates. Backed by dynamic Index
array.O(1), Append O(1) amortized, Insert/De
Tuple Ordered, immutable. Use for fixed records, keys in dict. Access O(1)
Set Unordered, unique items, hash-based. Add/Remove/Membership O(1) average
Dict Key-value map, hash-based. Lookup/Insert/Delete O(1) average
Deque (collections.deque) Double-ended queue: fast appends/pops both ends. append/pop O(1)
Heap (heapq) Binary min-heap via list. push/pop O(log n)
Array (numpy.ndarray) Contiguous typed array, vectorized ops. Element access O(1), vector ops compact & f
Examples
List
nums = [1, 2, 3] nums.append(4) # slicing first_two = nums[:2]
Dict
student = {'name':'Alice','age':21} age = student.get('age') student['grade']='A'
Set
s = set([1,2,3]) s.add(4) if 2 in s: ...
Deque
from collections import deque d = deque([1,2,3]) d.appendleft(0) val = d.popleft()
2. Collections & Key Libraries for Data Engineering
collections: Counter, defaultdict, namedtuple, deque — useful for logs, counts, grouping.
heapq: min-heap. To simulate max-heap push negatives.
bisect: binary search helpers (bisect_left/right) for insertion points.
itertools: combinations, permutations, groupby, islice, chain — great for streaming data.
pandas: DataFrame/Series — core for ETL, aggregations, joins, resampling.
numpy: numerical arrays, vectorized ops—fast.
Pandas — Quick Practical Examples
import pandas as pd # read CSV in chunks (memory efficient) for chunk in
pd.read_csv('large.csv', chunksize=10_000): process(chunk) # common ops df =
pd.read_parquet('data.parquet') df = df.dropna(subset=['user_id']) agg =
df.groupby('country')['revenue'].sum().reset_index() # merge merged = df1.merge(df2,
how='left', on='id')
3. Sorting Algorithms (with Python examples & complexity)
Selection Sort — O(n^2), Insertion Sort — O(n^2) (good for nearly-sorted), Merge Sort — O(n log n)
stable, Quick Sort — O(n log n) avg (in-place), Heap Sort — O(n log n). Built-in sorted()/list.sort() uses
Timsort (stable, O(n log n) worst-case, optimized for runs).
def merge_sort(arr): if len(arr) <= 1: return arr mid = len(arr) // 2 left =
merge_sort(arr[:mid]) right = merge_sort(arr[mid:]) i = j = 0 merged = [] while i <
len(left) and j < len(right): if left[i] <= right[j]: merged.append(left[i]); i += 1 else:
merged.append(right[j]); j += 1 merged.extend(left[i:]); merged.extend(right[j:]) return
merged # Usage print(merge_sort([5,2,9,1]))
import random def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr)//2]
left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for
x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) # Usage
print(quick_sort([3,6,8,10,1,2,1]))
4. Searching & Graph Traversals
Binary Search (sorted array) — O(log n). DFS/BFS for graphs/trees: O(V+E). Use iterative (stack) or
recursive (watch recursion depth).
def binary_search(arr, target): lo, hi = 0, len(arr)-1 while lo <= hi: mid = (lo+hi)//2 if
arr[mid] == target: return mid elif arr[mid] < target: lo = mid + 1 else: hi = mid - 1
return -1 # Usage print(binary_search([1,2,4,5,9], 5))
from collections import deque def bfs(graph, start): visited = set([start]) q =
deque([start]) order = [] while q: node = q.popleft() order.append(node) for nb in
graph.get(node, []): if nb not in visited: visited.add(nb) q.append(nb) return order #
Usage graph = {'A':['B','C'], 'B':['D'], 'C':[], 'D':[]} print(bfs(graph,'A'))
5. Generators, Itertools, File I/O, Concurrency basics
Generators: lazy evaluation, memory efficient. itertools: groupby, islice, chain, tee. File I/O: use with
open(...) as f, process in chunks for large files. Concurrency: threading for I/O-bound, multiprocessing for
CPU-bound, asyncio for async I/O.
# generator example def read_lines(path): with open(path,'r') as f: for line in f: yield
line.strip() # itertools example import itertools for k, group in
itertools.groupby(sorted(data), key=lambda x: x['user']): handle_group(k, list(group))
6. Practical Data-Engineer Snippets
- Reading large CSV in chunks (pandas) and writing parquet. - Streaming from S3: use boto3, smart_open
or s3fs with pandas. - Efficient joins: ensure indexing, use categorical dtypes for memory savings. - Use
dtype argument in read_csv to reduce memory usage. - Use parquet for faster IO and compression. - Use
vectorized operations in pandas (avoid row-wise loops).
# read csv in chunks and write to parquet partitioned by country import pandas as pd for
chunk in pd.read_csv('big.csv', chunksize=100_000, dtype={'user_id': str}):
chunk.to_parquet('out.parquet', index=False, compression='snappy', engine='pyarrow')
7. Interview Tips & Common Questions
• Explain tradeoffs (time vs memory). Use Big-O. • Talk about stability of sorting when relevant. • For data
engineering: discuss data formats (CSV/Parquet/ORC), partitioning, schema, null handling. • Be ready to
write code (two-pointer, sliding window), and small ETL tasks (reading, grouping, aggregating). • Common
questions: implement LRU cache, merge k sorted lists, streaming median, deduplicate large file (external
sort).
Generated for: Freshers preparing for Data Engineer interviews — concise, practical, and example-driven. Good luck!