import tensorflow as tf
from tensorflow import keras
from [Link] import Dense, Activation, Dropout
from [Link] import Adam
from [Link] import Accuracy
• Imports TensorFlow and Keras: The code imports TensorFlow and Keras, which are
libraries for building and training neural networks.
• Model Layers: It imports key layers such as Dense (fully connected layer), Activation
(for activation functions like ReLU, Sigmoid), and Dropout (to prevent overfitting).
• Optimizer: The Adam optimizer is imported, which is an efficient gradient descent
optimization algorithm commonly used in training deep learning models.
• Metrics: Accuracy is imported as a performance metric to evaluate the model’s correctness
during training and testing.
• Foundation for Model Building: This setup allows you to easily define, compile, and
train a neural network model using Keras.
import pandas as pd
import [Link] as plt
import numpy as np
import seaborn as sns
from sklearn import metrics
from [Link] import LabelEncoder
from [Link] import classification_report,accuracy_score,roc_curve,confusion_matrix
• Import Data Handling Library: pandas is imported for handling and manipulating
datasets, especially in tabular form.
• Visualization Libraries: [Link] and seaborn are used for data
visualization, where matplotlib provides basic plotting, and seaborn enhances the
aesthetics of the plots.
• Numerical Operations: numpy is imported for efficient numerical computations, such as
array operations.
• Machine Learning Metrics: The metrics module from sklearn is imported to evaluate
model performance, including functions like confusion matrix, ROC curve, and classification
reports.
• Data Preprocessing: LabelEncoder from sklearn is imported to convert categorical data
into numerical form, a common step in preprocessing for machine learning models.
pip install jupyterthemes
• Installs Jupyter Themes: The command installs the jupyterthemes package, which
allows customization of the appearance of Jupyter Notebooks.
• Theme Customization: After installation, you can apply various themes, fonts, and color
schemes to personalize your Jupyter environment.
• Improves Visual Aesthetics: It helps enhance the readability and aesthetics of Jupyter
notebooks, making it visually appealing during presentations.
• Easy Application: Once installed, themes can be applied using simple commands like jt
-t <theme-name> in the terminal.
from jupyterthemes import jtplot
[Link](theme = 'monokai', context = 'notebook', ticks = True, grid =
False)
• Import Jupyter Plot Styling: Imports jtplot from the jupyterthemes package to
customize plot styles in Jupyter Notebooks.
• Apply Monokai Theme: The code sets the plot theme to 'monokai', a dark color scheme
that enhances visual contrast.
• Customize Plot Context: context = 'notebook' sets the visual context of the plots,
optimizing them for use in notebooks.
• Plot Display Settings: ticks = True ensures axis ticks are shown, and grid = False
disables the background grid for a cleaner look.
#Load the training and testing datasets
instagram_df_test = pd.read_csv('[Link]')
instagram_df_train = pd.read_csv('[Link]')
• Load Data with Pandas: The code uses the pandas library to read CSV files, which is a
common format for storing tabular data.
• Training Dataset: instagram_df_train is assigned the data from the '[Link]' file,
which typically contains examples used to train a machine learning model.
• Testing Dataset: instagram_df_test is assigned the data from the '[Link]' file, used to
evaluate the model's performance after training.
• DataFrames Creation: Both variables create Pandas DataFrames, allowing for easy
manipulation and analysis of the data.
• Assumption of CSV Structure: The code assumes that both CSV files are structured
correctly with appropriate headers for the features needed in the analysis.
• Preparation for Analysis: Loading the datasets is a crucial step that prepares them for
preprocessing, feature extraction, and model training.
#Getting dataframe info
instagram_df_train.info()
• Display DataFrame Information: The code uses the .info() method of the Pandas
DataFrame to display information about instagram_df_train.
• Overview of DataFrame Structure: It provides a summary of the DataFrame, including
the number of entries (rows) and the number of columns.
• Data Types: The method lists the data types of each column (e.g., integer, float, object),
helping to identify the nature of the data.
• Non-null Count: It shows the count of non-null entries for each column, which is useful
for detecting missing values.
• Memory Usage: The output includes the memory usage of the DataFrame, indicating how
much memory is consumed by the data, which is important for optimizing performance.
• Initial Data Exploration: Using .info() is a fundamental step in data exploration,
providing insights that inform further data cleaning and preprocessing.
#Statistical summary of the dataframe
instagram_df_train.describe()
• Statistical Summary: The code uses the .describe() method of the Pandas DataFrame
to generate a statistical summary of instagram_df_train.
• Numerical Features Analysis: It computes statistics for numerical columns, including
count, mean, standard deviation, minimum, maximum, and quartiles.
• Count of Entries: The output includes the number of non-null entries for each numerical
column, helping identify missing values.
• Understanding Data Distribution: The summary statistics provide insights into the
distribution and central tendencies of the numerical features, useful for understanding the
data.
• Outlier Detection: The minimum and maximum values help in identifying potential
outliers that may need further investigation or preprocessing.
#Check if null values exist
instagram_df_train.isnull().sum()
• Check for Null Values: The code uses the .isnull().sum() method on the
instagram_df_train DataFrame to check for missing values in the dataset.
• Returns Null Count: This method returns a Series with the count of null (missing) values
for each column in the DataFrame.
• Data Quality Assessment: By examining the output, you can assess the quality of the data
and identify columns that may require cleaning or imputation due to missing values.
• Critical Preprocessing Step: Identifying null values is a crucial step in data
preprocessing, as it informs decisions on how to handle them, whether by removing, filling,
or flagging them.
• Understand Impact on Model: Knowing the presence of null values helps evaluate their
potential impact on model training and prediction accuracy.
#Number of unique values in the profile pic column
instagram_df_train['profile pic'].value_counts()
• Count Unique Values: The code uses the .value_counts() method on the 'profile
pic' column of the instagram_df_train DataFrame to count the number of unique values.
• Profile Picture Analysis: This method helps identify how many distinct profile pictures
are present in the dataset, which can be relevant for identifying patterns or behaviors.
• Frequency of Each Value: The output lists each unique profile picture along with the
number of occurrences in the dataset, providing insights into common profile pictures.
• Data Distribution Insight: Understanding the distribution of profile pictures can be useful
for feature engineering, especially if certain images are associated with fake accounts.
• Spotting Anomalies: If there are very few unique values or an unexpected distribution, it
might indicate potential issues or anomalies in the dataset.
• Guidance for Further Analysis: The results can inform further analysis, such as
evaluating the impact of profile pictures on account legitimacy or user engagement.
#Number of accounts having description length over 50
(instagram_df_train['description length'] > 50).sum()
• Check Description Length: The code checks the length of descriptions in the
'description length' column of the instagram_df_train DataFrame to find accounts
with descriptions longer than 50 characters.
• Boolean Condition: The expression (instagram_df_train['description length'] >
50) creates a boolean Series, where each entry is True if the condition is met and False
otherwise.
• Count True Values: The .sum() method counts the number of True values in the boolean
Series, effectively giving the total number of accounts with description lengths greater than
50.
• Insight into Account Profiles: This count provides insights into how many users prefer
longer descriptions, which may correlate with engagement or account type.
• Data Exploration: Analyzing description lengths can help in understanding user behavior
and preferences on the platform, aiding in user classification.
• Guidance for Further Analysis: The result can inform subsequent steps, such as
examining the impact of description length on account legitimacy or user interaction metrics.
#Vislualizing the number of fake and real accounts (using seaborn library)
[Link](instagram_df_train['fake'])
• Visualization of Account Types: The code uses Seaborn’s countplot() function to
visualize the number of fake and real accounts in the 'fake' column of the
instagram_df_train DataFrame.
• Countplot Functionality: The countplot() function automatically counts the
occurrences of each category (fake or real) in the specified column and displays them as bars.
• Categorical Data Representation: This visualization provides a clear representation of
the distribution of account types, making it easy to compare the counts of fake and real
accounts.
• Quick Insights: The plot offers immediate visual insights into the dataset's balance
between fake and real accounts, which is crucial for assessing model training needs.
• Identifying Class Imbalance: If one category significantly outweighs the other, it may
indicate class imbalance, which could impact the performance of machine learning models.
• Enhancing Data Interpretation: Using visualizations like this helps convey findings
more effectively, making it easier for stakeholders to understand the distribution of account
types in the dataset.
#Visualizing the private column
[Link](instagram_df_train['private'],palette = "PuBu")
• Visualization of Privacy Settings: The code uses Seaborn’s countplot() function to
visualize the distribution of accounts based on their privacy settings, represented by the
'private' column in the instagram_df_train DataFrame.
• Categorical Count Representation: The countplot() function counts the occurrences of
each category (private or public) and displays them as bars, allowing for easy comparison.
• Color Palette: The palette = "PuBu" parameter specifies a color palette for the bars,
using shades of blue to enhance visual appeal and clarity in the plot.
• Understanding User Preferences: This visualization provides insights into user
preferences regarding account privacy, which may relate to their engagement and behavior on
the platform.
• Immediate Insights: The plot allows for quick visual assessment of how many accounts
are set to private versus public, aiding in understanding overall account settings.
#Visualizing the profile pic feature
[Link](instagram_df_train['profile pic'],palette = "Pastel2")
• Profile Picture Visualization: The code uses Seaborn’s countplot() function to
visualize the distribution of unique profile pictures in the 'profile pic' column of the
instagram_df_train DataFrame.
• Categorical Count Representation: countplot() automatically counts how many times
each profile picture appears and displays these counts as bars, facilitating comparison among
different pictures.
• Color Palette: The palette = "Pastel2" parameter specifies a soft, pastel color scheme
for the bars, enhancing the visual appeal of the plot.
• Identifying Common Profile Pictures: This visualization helps in identifying the most
common profile pictures used by users, which can be relevant for analyzing user behavior or
account authenticity.
• Understanding Data Distribution: By examining the distribution of profile pictures,
insights can be gained about user tendencies, such as whether certain images are more
frequently associated with fake accounts.
#Visualizing the length of usernames(Histogram)
[Link](figsize = (20, 10))
[Link](instagram_df_train['nums/length username'],kde=True)
• Histogram Visualization: The code visualizes the distribution of username lengths in the
'nums/length username' column of the instagram_df_train DataFrame using a
histogram.
• Figure Size Specification: The [Link](figsize=(20, 10)) function sets the size of
the figure to 20 inches wide by 10 inches tall, ensuring the plot is large and easy to read.
• Density Plot Overlay: The [Link]() function includes a kernel density estimate
(KDE) overlay, providing a smoothed line that represents the distribution of username
lengths alongside the histogram.
• Understanding Username Length Distribution: This visualization helps identify patterns
in username lengths, such as common lengths or outliers, which can be relevant for analyzing
account behavior or legitimacy.
• Assessing Data Characteristics: By examining the distribution, insights can be gained
about the potential impact of username length on user engagement or the likelihood of
accounts being fake.
• Informing Further Analysis: The insights drawn from this histogram can guide further
investigations, such as studying correlations between username length and other account
features.
#Correlation heatmap
[Link](figsize=(15,15))
cm = instagram_df_train.corr()
ax = [Link]()
[Link](cm, annot = True, ax = ax)
• Correlation Matrix Calculation: The code computes the correlation matrix of the
instagram_df_train DataFrame using the .corr() method, which quantifies the
relationships between numerical features.
• Figure Size Specification: [Link](figsize=(15, 15)) sets the size of the heatmap
to 15 inches by 15 inches, ensuring clear visibility of the heatmap and annotations.
• Creating the Heatmap: The [Link]() function is used to create a heatmap
visualizing the correlation matrix, where color intensity represents the strength of the
correlation between features.
• Annotations on the Heatmap: The annot=True parameter displays the correlation
coefficients on the heatmap, allowing for easy interpretation of the relationships between
features.
[Link](epochs_hist.history['loss'])
[Link](epochs_hist.history['val_loss'])
[Link]('Model Loss Progressioin During Training/Validation')
[Link]('Epoch Number')
[Link]('Training and Validation Losses')
[Link](['Training Loss','Valdiation Loss'])
• Plotting Training and Validation Loss: The code plots the training loss and validation
loss from the model's training history, allowing for visual assessment of the model's
performance over epochs.
• Accessing Loss History: epochs_hist.history['loss'] retrieves the training loss
values, while epochs_hist.history['val_loss'] retrieves the validation loss values for
each epoch.
• Title of the Plot: [Link]() sets the title of the plot to "Model Loss Progression During
Training/Validation," providing context for the visualization.
• Axis Labels: [Link]() and [Link]() specify the labels for the x-axis (Epoch
Number) and y-axis (Training and Validation Losses), enhancing the plot's clarity.
• Legend for Clarity: The [Link]() function adds a legend to differentiate between
the training loss and validation loss, making it easier to interpret the plot.
• Assessing Overfitting or Underfitting: By analyzing the loss curves, you can identify
potential issues like overfitting (when validation loss increases while training loss decreases)
or underfitting (high loss values for both training and validation).
predicted_value = []
test = []
for i in predicted:
predicted_value.append([Link](i))
for i in Y_test:
[Link]([Link](i))
• Initialization of Lists: Two empty lists, predicted_value and test, are initialized to
store the predicted class labels and the true class labels from the test set, respectively.
• Iterating Over Predictions: The first for loop iterates through the predicted array
(assumed to be the output of a model), applying [Link]() to each element. This function
retrieves the index of the maximum value, effectively converting predicted probabilities to
class labels.
• Storing Predicted Class Labels: Each class label obtained from the [Link]()
function is appended to the predicted_value list, which will hold the final predicted classes
for evaluation.
• Iterating Over True Labels: The second for loop iterates through the Y_test array
(assumed to be the true labels of the test data), similarly using [Link]() to convert one-
hot encoded labels into class indices.
• Storing True Class Labels: The class labels from Y_test are appended to the test list,
which will be used to compare against the predicted labels for performance evaluation.
[Link](figsize=(10, 10))
con_matrix = confusion_matrix(test,predicted_value)
[Link](con_matrix, annot=True)
• Figure Size Specification: The code sets up a plot with a specified size of 10 inches by 10
inches using [Link](figsize=(10, 10)), ensuring the heatmap will be clearly visible.
• Confusion Matrix Calculation: The confusion_matrix() function computes the
confusion matrix using the true labels (test) and the predicted labels (predicted_value),
summarizing the performance of the classification model.
• Heatmap Visualization: The [Link]() function visualizes the confusion matrix as
a heatmap, where color intensity represents the count of true positives, false positives, true
negatives, and false negatives.
• Annotations for Clarity: By default, annot=True in [Link]() displays the
numerical values in each cell of the heatmap, allowing for easy interpretation of the
confusion matrix.
• Assessing Model Performance: The heatmap provides a visual representation of the
classification performance, making it easier to identify which classes are being correctly
predicted and which are being misclassified.