0% found this document useful (0 votes)
119 views2 pages

Python Cheatsheet v2 Data Engineer

Uploaded by

Shreyan Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views2 pages

Python Cheatsheet v2 Data Engineer

Uploaded by

Shreyan Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Python Cheatsheet — Data Engineer Interview

(Fresher)
Comprehensive, colored, example-driven cheatsheet: core data structures, sorting & searching algorithms,
libraries, and practical snippets you’ll use in interviews.

Contents 1. Core Data Structures & Complexity


2. Collections & Useful Libraries (pandas, numpy, heapq, bisect)
3. Sorting Algorithms (code + analysis)
4. Searching & Graph Traversals (code + examples)
5. Generators, Itertools, File I/O, Concurrency basics
6. Practical Data-Engineer Snippets (CSV/Parquet, Pandas ops, chunking, streaming)
7. Interview Tips & Common Questions

1. Core Data Structures & Complexity


StructureDescriptionCommon Ops (avg)

Structure Description Complexity

List Ordered, mutable, allows duplicates. Backed by dynamic Index


array.O(1), Append O(1) amortized, Insert/De

Tuple Ordered, immutable. Use for fixed records, keys in dict. Access O(1)

Set Unordered, unique items, hash-based. Add/Remove/Membership O(1) average

Dict Key-value map, hash-based. Lookup/Insert/Delete O(1) average

Deque (collections.deque) Double-ended queue: fast appends/pops both ends. append/pop O(1)

Heap (heapq) Binary min-heap via list. push/pop O(log n)

Array (numpy.ndarray) Contiguous typed array, vectorized ops. Element access O(1), vector ops compact & f

Examples
List
nums = [1, 2, 3] nums.append(4) # slicing first_two = nums[:2]

Dict
student = {'name':'Alice','age':21} age = student.get('age') student['grade']='A'

Set
s = set([1,2,3]) s.add(4) if 2 in s: ...

Deque
from collections import deque d = deque([1,2,3]) d.appendleft(0) val = d.popleft()

2. Collections & Key Libraries for Data Engineering


collections: Counter, defaultdict, namedtuple, deque — useful for logs, counts, grouping.
heapq: min-heap. To simulate max-heap push negatives.
bisect: binary search helpers (bisect_left/right) for insertion points.
itertools: combinations, permutations, groupby, islice, chain — great for streaming data.
pandas: DataFrame/Series — core for ETL, aggregations, joins, resampling.
numpy: numerical arrays, vectorized ops—fast.

Pandas — Quick Practical Examples


import pandas as pd # read CSV in chunks (memory efficient) for chunk in
pd.read_csv('large.csv', chunksize=10_000): process(chunk) # common ops df =
pd.read_parquet('data.parquet') df = df.dropna(subset=['user_id']) agg =
df.groupby('country')['revenue'].sum().reset_index() # merge merged = df1.merge(df2,
how='left', on='id')
3. Sorting Algorithms (with Python examples & complexity)
Selection Sort — O(n^2), Insertion Sort — O(n^2) (good for nearly-sorted), Merge Sort — O(n log n)
stable, Quick Sort — O(n log n) avg (in-place), Heap Sort — O(n log n). Built-in sorted()/list.sort() uses
Timsort (stable, O(n log n) worst-case, optimized for runs).
def merge_sort(arr): if len(arr) <= 1: return arr mid = len(arr) // 2 left =
merge_sort(arr[:mid]) right = merge_sort(arr[mid:]) i = j = 0 merged = [] while i <
len(left) and j < len(right): if left[i] <= right[j]: merged.append(left[i]); i += 1 else:
merged.append(right[j]); j += 1 merged.extend(left[i:]); merged.extend(right[j:]) return
merged # Usage print(merge_sort([5,2,9,1]))

import random def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr)//2]
left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for
x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) # Usage
print(quick_sort([3,6,8,10,1,2,1]))

4. Searching & Graph Traversals


Binary Search (sorted array) — O(log n). DFS/BFS for graphs/trees: O(V+E). Use iterative (stack) or
recursive (watch recursion depth).
def binary_search(arr, target): lo, hi = 0, len(arr)-1 while lo <= hi: mid = (lo+hi)//2 if
arr[mid] == target: return mid elif arr[mid] < target: lo = mid + 1 else: hi = mid - 1
return -1 # Usage print(binary_search([1,2,4,5,9], 5))

from collections import deque def bfs(graph, start): visited = set([start]) q =


deque([start]) order = [] while q: node = q.popleft() order.append(node) for nb in
graph.get(node, []): if nb not in visited: visited.add(nb) q.append(nb) return order #
Usage graph = {'A':['B','C'], 'B':['D'], 'C':[], 'D':[]} print(bfs(graph,'A'))

5. Generators, Itertools, File I/O, Concurrency basics


Generators: lazy evaluation, memory efficient. itertools: groupby, islice, chain, tee. File I/O: use with
open(...) as f, process in chunks for large files. Concurrency: threading for I/O-bound, multiprocessing for
CPU-bound, asyncio for async I/O.
# generator example def read_lines(path): with open(path,'r') as f: for line in f: yield
line.strip() # itertools example import itertools for k, group in
itertools.groupby(sorted(data), key=lambda x: x['user']): handle_group(k, list(group))

6. Practical Data-Engineer Snippets


- Reading large CSV in chunks (pandas) and writing parquet. - Streaming from S3: use boto3, smart_open
or s3fs with pandas. - Efficient joins: ensure indexing, use categorical dtypes for memory savings. - Use
dtype argument in read_csv to reduce memory usage. - Use parquet for faster IO and compression. - Use
vectorized operations in pandas (avoid row-wise loops).
# read csv in chunks and write to parquet partitioned by country import pandas as pd for
chunk in pd.read_csv('big.csv', chunksize=100_000, dtype={'user_id': str}):
chunk.to_parquet('out.parquet', index=False, compression='snappy', engine='pyarrow')

7. Interview Tips & Common Questions


• Explain tradeoffs (time vs memory). Use Big-O. • Talk about stability of sorting when relevant. • For data
engineering: discuss data formats (CSV/Parquet/ORC), partitioning, schema, null handling. • Be ready to
write code (two-pointer, sliding window), and small ETL tasks (reading, grouping, aggregating). • Common
questions: implement LRU cache, merge k sorted lists, streaming median, deduplicate large file (external
sort).

Generated for: Freshers preparing for Data Engineer interviews — concise, practical, and example-driven. Good luck!

You might also like