0% found this document useful (0 votes)
35 views6 pages

Cheat Sheet - Pandas

iml

Uploaded by

BENAZIR AE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Cheat Sheet - Pandas

iml

Uploaded by

BENAZIR AE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Cheat Sheet - Exploratory Data

Analysis (EDA) with Pandas


1. Data Loading
 Read CSV File: df = pd.read_csv('filename.csv')
 Read Excel File: df = pd.read_excel('filename.xlsx')
 Read from SQL Database: df = pd.read_sql(query, connection)

2. Basic Data Inspection


 Display Top Rows: df.head()
 Display Bottom Rows: df.tail()
 Display Data Types: df.dtypes
 Summary Statistics: df.describe()
 Display Index, Columns, and Data: df.info()

3. Data Cleaning
 Check for Missing Values: df.isnull().sum()
 Fill Missing Values: df.fillna(value)
 Drop Missing Values: df.dropna()
 Rename Columns: df.rename(columns={'old_name': 'new_name'})
 Drop Columns: df.drop(columns=['column_name'])

4. Data Transformation
 Apply Function: df['column'].apply(lambda x: function(x))
 Group By and Aggregate: df.groupby('column').agg({'column': 'sum'})
 Pivot Tables: df.pivot_table(index='column1', values='column2',
aggfunc='mean')
 Merge DataFrames: pd.merge(df1, df2, on='column')
 Concatenate DataFrames: pd.concat([df1, df2])

5. Data Visualization Integration


 Histogram: df['column'].hist()
 Boxplot: df.boxplot(column=['column1', 'column2'])
 Scatter Plot: df.plot.scatter(x='col1', y='col2')
 Line Plot: df.plot.line()
 Bar Chart: df['column'].value_counts().plot.bar()
6. Statistical Analysis
 Correlation Matrix: df.corr()
 Covariance Matrix: df.cov()
 Value Counts: df['column'].value_counts()
 Unique Values in Column: df['column'].unique()
 Number of Unique Values: df['column'].nunique()

7. Indexing and Selection


 Select Column: df['column']
 Select Multiple Columns: df[['col1', 'col2']]
 Select Rows by Position: df.iloc[0:5]
 Select Rows by Label: df.loc[0:5]
 Conditional Selection: df[df['column'] > value]

8. Data Formatting and Conversion


 Convert Data Types: df['column'].astype('type')
 String Operations: df['column'].str.lower()
 Datetime Conversion: pd.to_datetime(df['column'])
 Setting Index: df.set_index('column')

9. Advanced Data Transformation


 Lambda Functions: df.apply(lambda x: x + 1)
 Pivot Longer/Wider Format: df.melt(id_vars=['col1'])
 Stack/Unstack: df.stack(), df.unstack()
 Cross Tabulations: pd.crosstab(df['col1'], df['col2'])

10. Handling Time Series Data


 Set Datetime Index: df.set_index(pd.to_datetime(df['date']))
 Resampling Data: df.resample('M').mean()
 Rolling Window Operations: df.rolling(window=5).mean()

11. File Export


 Write to CSV: df.to_csv('filename.csv')
 Write to Excel: df.to_excel('filename.xlsx')
 Write to SQL Database: df.to_sql('table_name', connection)

12. Data Exploration Techniques


 Profile Report (with pandas-profiling): from pandas_profiling import
ProfileReport; ProfileReport(df)
 Pairplot (with seaborn): import seaborn as sns; sns.pairplot(df)
 Heatmap for Correlation (with seaborn): sns.heatmap(df.corr(), annot=True)

13. Advanced Data Queries


 Query Function: df.query('column > value')
 Filtering with isin: df[df['column'].isin([value1, value2])]

14. Memory Optimization


 Reducing Memory Usage: df.memory_usage(deep=True)
 Change Data Types to Save Memory: df['column'].astype('category')

15. Multi-Index Operations


 Creating MultiIndex: df.set_index(['col1', 'col2'])
 Slicing on MultiIndex: df.loc[(slice('index1_start', 'index1_end'),
slice('index2_start', 'index2_end'))]

16. Data Merging Techniques


 Outer Join: pd.merge(df1, df2, on='column', how='outer')
 Inner Join: pd.merge(df1, df2, on='column', how='inner')
 Left Join: pd.merge(df1, df2, on='column', how='left')
 Right Join: pd.merge(df1, df2, on='column', how='right')

17. Dealing with Duplicates


 Finding Duplicates: df.duplicated()
 Removing Duplicates: df.drop_duplicates()

18. Custom Operations with Apply


 Custom Apply Functions: df.apply(lambda row: custom_func(row['col1'],
row['col2']), axis=1)

19. Handling Large Datasets


 Chunking Large Files: pd.read_csv('large_file.csv', chunksize=1000)
 Iterating Through Data Chunks: for chunk in pd.read_csv('file.csv',
chunksize=500): process(chunk)
20. Integration with Matplotlib for Custom Plots
 Custom Plotting: import matplotlib.pyplot as plt; df.plot(); plt.show()

21. Specialized Data Types Handling


 Working with Categorical Data: df['column'].astype('category')
 Dealing with Sparse Data: pd.arrays.SparseArray(df['column'])

22. Performance Tuning


 Using Swifter for Faster Apply: import swifter;
df['column'].swifter.apply(lambda x: func(x))
 Parallel Processing with Dask: import dask.dataframe as dd; ddf =
dd.from_pandas(df, npartitions=10)

23. Visualization Enhancement


 Customize Plot Style: plt.style.use('ggplot')
 Histogram with Bins Specification: df['column'].hist(bins=20)
 Boxplot Grouped by Category: df.boxplot(column='num_column',
by='cat_column')

24. Advanced Grouping and Aggregation


 Group by Multiple Columns: df.groupby(['col1', 'col2']).mean()
 Aggregate with Multiple Functions: df.groupby('col').agg(['mean', 'sum'])
 Transform Function: df.groupby('col').transform(lambda x: x - x.mean())

25. Time Series Specific Operations


 Time-Based Grouping: df.groupby(pd.Grouper(key='date_col',
freq='M')).sum()
 Shifting Series for Lag Analysis: df['column'].shift(1)
 Resample Time Series Data: df.resample('M', on='date_col').mean()

26. Text Data Specific Operations


 String Contains: df[df['column'].str.contains('substring')]
 String Split: df['column'].str.split(' ', expand=True)
 Regular Expression Extraction: df['column'].str.extract(r'(regex)')

27. Data Normalization and Standardization


 Min-Max Normalization: (df['column'] - df['column'].min()) / (df['column'].max()
- df['column'].min())
 Z-Score Standardization: (df['column'] - df['column'].mean()) /
df['column'].std()

28. Working with JSON and XML


 Reading JSON: df = pd.read_json('filename.json')
 Reading XML: df = pd.read_xml('filename.xml')

29. Advanced File Handling


 Read CSV with Specific Delimiter: df = pd.read_csv('filename.csv',
delimiter=';')
 Writing to JSON: df.to_json('filename.json')

30. Dealing with Missing Data


 Interpolate Missing Values: df['column'].interpolate()
 Forward Fill Missing Values: df['column'].ffill()
 Backward Fill Missing Values: df['column'].bfill()

31. Data Reshaping


 Wide to Long Format: pd.wide_to_long(df, ['col'], i='id_col', j='year')
 Long to Wide Format: df.pivot(index='id_col', columns='year', values='col')

32. Categorical Data Operations


 Convert Column to Categorical: df['column'] = df['column'].astype('category')
 Order Categories: df['column'].cat.set_categories(['cat1', 'cat2'],
ordered=True)

33. Advanced Indexing


 Reset Index: df.reset_index(drop=True)
 Set Multiple Indexes: df.set_index(['col1', 'col2'])
 MultiIndex Slicing: df.xs(key='value', level='level_name')

34. Efficient Computations


 Use of eval() for Efficient Operations: df.eval('col1 + col2')
 Query Method for Filtering: df.query('col1 < col2')
35. Integration with SciPy and StatsModels
 Linear Regression (with statsmodels): import statsmodels.api as sm;
sm.OLS(y, X).fit()
 Kurtosis and Skewness (with SciPy): from scipy.stats import kurtosis, skew;
kurtosis(df['column']), skew(df['column'])

36. Handling Large Data Efficiently


 Dask Integration for Large Data: import dask.dataframe as dd; ddf =
dd.from_pandas(df, npartitions=10)
 Sampling Data for Quick Insights: df.sample(n=1000)

37. Advanced Data Merging


 SQL-like Joins: pd.merge(df1, df2, how='left', on='col')
 Concatenating Along a Different Axis: pd.concat([df1, df2], axis=1)

38. Profiling Data for Quick Insights


 Using Pandas Profiling for Quick Analysis: from pandas_profiling import
ProfileReport; report = ProfileReport(df)

39. Working with External Data Sources


 Reading Data from HTML: dfs = pd.read_html('http://example.com')
 Connecting to a SQL Database: from sqlalchemy import create_engine; engine
= create_engine('sqlite:///db.sqlite'); df = pd.read_sql('SELECT * FROM
table_name', engine)

40. Data Quality Checks


 Assert Statement for Data Validation: assert df.notnull().all().all(), "There are
missing values in the dataframe"

You might also like