10/16/25, 8:24 PM assignment4
Assignment4 (Score: 3.0 / 3.0)
1. Test cell (Score: 1.0 / 1.0)
2. Test cell (Score: 1.0 / 1.0)
3. Test cell (Score: 1.0 / 1.0)
Assignment 4
In [1]: import networkx as nx
import pandas as pd
import numpy as np
import pickle
Part 1 - Random Graph Identification
For the first part of this assignment you will analyze randomly generated graphs and determine which
algorithm created them.
In [2]: G1 = nx.read_gpickle("assets/A4_P1_G1")
G2 = nx.read_gpickle("assets/A4_P1_G2")
G3 = nx.read_gpickle("assets/A4_P1_G3")
G4 = nx.read_gpickle("assets/A4_P1_G4")
G5 = nx.read_gpickle("assets/A4_P1_G5")
P1_Graphs = [G1, G2, G3, G4, G5]
P1_Graphs is a list containing 5 networkx graphs. Each of these graphs were generated by one of three
possible algorithms:
Preferential Attachment ( 'PA' )
Small World with low probability of rewiring ( 'SW_L' )
Small World with high probability of rewiring ( 'SW_H' )
Anaylze each of the 5 graphs using any methodology and determine which of the three algorithms
generated each graph.
The graph_identification function should return a list of length 5 where each element in the list is
either 'PA' , 'SW_L' , or 'SW_H' .
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 1/8
10/16/25, 8:24 PM assignment4
In [3]: Student's answer (Top)
def graph_identification():
# Analyze graphs to identify which algorithm generated them
# PA (Preferential Attachment): high clustering, power-law de
gree distribution
# SW_L (Small World Low rewiring): high clustering, low avera
ge shortest path
# SW_H (Small World High rewiring): low clustering, low avera
ge shortest path
results = []
for g in P1_Graphs:
# Calculate key metrics
avg_clustering = nx.average_clustering(g)
try:
avg_path_length = nx.average_shortest_path_length(g)
except:
# If graph is disconnected, use the largest component
largest_cc = max(nx.connected_components(g), key=len)
subgraph = g.subgraph(largest_cc)
avg_path_length = nx.average_shortest_path_length(sub
graph)
# Degree distribution analysis
degrees = [d for n, d in g.degree()]
avg_degree = np.mean(degrees)
std_degree = np.std(degrees)
max_degree = np.max(degrees)
# Classification logic:
# PA: Very high degree variance, some nodes with very hig
h degree
# SW_L: High clustering, relatively low path length
# SW_H: Low clustering, low path length
degree_variance = std_degree / avg_degree if avg_degree >
0 else 0
# PA has very high degree variance (power law)
if degree_variance > 1.5 or max_degree > 3 * avg_degree:
results.append('PA')
# SW_L has high clustering coefficient
elif avg_clustering > 0.4:
results.append('SW_L')
# SW_H has lower clustering
else:
results.append('SW_H')
return results
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 2/8
10/16/25, 8:24 PM assignment4
In [4]: Grade cell: cell-efb9da7e1c19accf Score: 1.0 / 1.0 (Top)
ans_one = graph_identification()
assert type(ans_one) == list, "You must return a list"
Part 2 - Company Emails
For the second part of this assignment you will be working with a company's email network where each
node corresponds to a person at the company, and each edge indicates that at least one email has been
sent between two people.
The network also contains the node attributes Department and ManagmentSalary .
Department indicates the department in the company which the person belongs to, and
ManagmentSalary indicates whether that person is receiving a managment position salary.
In [5]: G = pickle.load(open('assets/email_prediction_NEW.txt', 'rb'))
print(f"Graph with {len(nx.nodes(G))} nodes and {len(nx.edges(G))}
edges")
Graph with 1005 nodes and 16706 edges
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 3/8
10/16/25, 8:24 PM assignment4
Part 2A - Salary Prediction
Using network G , identify the people in the network with missing values for the node attribute
ManagementSalary and predict whether or not these individuals are receiving a managment position
salary.
To accomplish this, you will need to create a matrix of node features of your choice using networkx, train a
sklearn classifier on nodes that have ManagementSalary data, and predict a probability of the node
receiving a managment salary for nodes where ManagementSalary is missing.
Your predictions will need to be given as the probability that the corresponding employee is receiving a
managment position salary.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of
0.75 or higher will recieve full points.
Using your trained classifier, return a Pandas series of length 252 with the data being the probability of
receiving managment salary, and the index being the node id.
Example:
1 1.0
2 0.0
5 0.8
8 1.0
...
996 0.7
1000 0.5
1001 0.0
Length: 252, dtype: float64
In [6]: list(G.nodes(data=True))[:5] # print the first 5 nodes
Out[6]: [(0, {'Department': 1, 'ManagementSalary': 0.0}),
(1, {'Department': 1, 'ManagementSalary': nan}),
(581, {'Department': 3, 'ManagementSalary': 0.0}),
(6, {'Department': 25, 'ManagementSalary': 1.0}),
(65, {'Department': 4, 'ManagementSalary': nan})]
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 4/8
10/16/25, 8:24 PM assignment4
In [7]: Student's answer (Top)
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import networkx as nx
def salary_predictions():
def is_management(node):
managementSalary = node[1]['ManagementSalary']
if managementSalary == 0:
return 0
elif managementSalary == 1:
return 1
else:
return None
df = pd.DataFrame(index=G.nodes())
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = [val for node, val in G.degree()]
df['degree_centrality'] = [val for node, val in nx.degree_cen
trality(G).items()]
df['closeness'] = [val for node, val in nx.closeness_centrali
ty(G).items()]
df['betweeness'] = [val for node, val in nx.betweenness_centr
ality(G).items()]
df['pr'] = [val for node, val in nx.pagerank(G).items()]
df['is_management'] = pd.Series([is_management(node) for node
in G.nodes(data=True)], index=df.index)
df_train = df[~pd.isnull(df['is_management'])]
df_test = df[pd.isnull(df['is_management'])]
features = ['clustering', 'degree', 'degree_centrality', 'clo
seness', 'betweeness', 'pr']
X_train = df_train[features]
Y_train = df_train['is_management']
X_test = df_test[features]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = MLPClassifier(hidden_layer_sizes=[10, 5], alpha=5,
random_state=0, solver='lbfgs', verbose=0)
clf.fit(X_train_scaled, Y_train)
test_proba = clf.predict_proba(X_test_scaled)[:, 1]
return pd.Series(test_proba, X_test.index)
In [8]: Grade cell: cell-bc9c23e7517908ab Score: 1.0 / 1.0 (Top)
ans_salary_preds = salary_predictions()
assert type(ans_salary_preds) == pd.core.series.Series, "You must
return a Pandas series"
assert len(ans_salary_preds) == 252, "The series must be of lengt
h 252"
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 5/8
10/16/25, 8:24 PM assignment4
Part 2B - New Connections Prediction
For the last part of this assignment, you will predict future connections between employees of the network.
The future connections information has been loaded into the variable future_connections . The index is
a tuple indicating a pair of nodes that currently do not have a connection, and the Future Connection
column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates
a future connection.
In [9]: future_connections = pd.read_csv('assets/Future_Connections.csv', i
ndex_col=0, converters={0: eval})
future_connections.head(10)
Out[9]:
Future Connection
(6, 840) 0.0
(4, 197) 0.0
(620, 979) 0.0
(519, 872) 0.0
(382, 423) 0.0
(97, 226) 1.0
(349, 905) 0.0
(429, 860) 0.0
(309, 989) 0.0
(468, 880) 0.0
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 6/8
10/16/25, 8:24 PM assignment4
Using network G and future_connections , identify the edges in future_connections with missing
values and predict whether or not these edges will have a future connection.
To accomplish this, you will need to:
1. Create a matrix of features of your choice for the edges found in future_connections using
Networkx
2. Train a sklearn classifier on those edges in future_connections that have Future Connection
data
3. Predict a probability of the edge being a future connection for those edges in future_connections
where Future Connection is missing.
Your predictions will need to be given as the probability of the corresponding edge being a future
connection.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of
0.75 or higher will recieve full points.
Using your trained classifier, return a series of length 122112 with the data being the probability of the
edge being a future connection, and the index being the edge as represented by a tuple of nodes.
Example:
(107, 348) 0.35
(542, 751) 0.40
(20, 426) 0.55
(50, 989) 0.35
...
(939, 940) 0.15
(555, 905) 0.35
(75, 101) 0.65
Length: 122112, dtype: float64
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 7/8
10/16/25, 8:24 PM assignment4
In [10]: Student's answer (Top)
def new_connections_predictions():
from sklearn.ensemble import GradientBoostingClassifier
future_connections['pref_attachment'] = [list(nx.preferential
_attachment(G, [node_pair]))[0][2]
for node_pair in fut
ure_connections.index]
future_connections['comm_neighbors'] = [len(list(nx.common_ne
ighbors(G, node_pair[0], node_pair[1])))
for node_pair in futu
re_connections.index]
train_data = future_connections[~future_connections['Future C
onnection'].isnull()]
test_data = future_connections[future_connections['Future Con
nection'].isnull()]
clf = GradientBoostingClassifier()
clf.fit(train_data[['pref_attachment','comm_neighbors']].valu
es, train_data['Future Connection'].values)
preds = clf.predict_proba(test_data[['pref_attachment','comm_
neighbors']].values)[:,1]
return pd.Series(preds, index=test_data.index)
new_connections_predictions()
Out[10]: (107, 348) 0.031823
(542, 751) 0.012931
(20, 426) 0.543026
(50, 989) 0.013104
(942, 986) 0.013103
...
(165, 923) 0.013183
(673, 755) 0.013103
(939, 940) 0.013103
(555, 905) 0.012931
(75, 101) 0.017730
Length: 122112, dtype: float64
In [11]: Grade cell: cell-979b4a17d794f3d0 Score: 1.0 / 1.0 (Top)
ans_prob_preds = new_connections_predictions()
assert type(ans_prob_preds) == pd.core.series.Series, "You must r
eturn a Pandas series"
assert len(ans_prob_preds) == 122112, "The series must be of leng
th 122112"
In [ ]:
This assignment was graded by mooc_adswpy:9154b96e4479, v1.37.030923
https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 8/8