0% found this document useful (0 votes)
65 views8 pages

Assignment 4

The document outlines Assignment 4, which consists of two main parts: analyzing randomly generated graphs to identify the algorithms used for their creation, and predicting management salaries and future connections within a company's email network. The assignment includes coding tasks using Python libraries such as NetworkX and scikit-learn, with specific requirements for returning results in the form of lists and Pandas series. Grading is based on the accuracy of predictions measured by the Area Under the ROC Curve (AUC).

Uploaded by

mymail8795
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views8 pages

Assignment 4

The document outlines Assignment 4, which consists of two main parts: analyzing randomly generated graphs to identify the algorithms used for their creation, and predicting management salaries and future connections within a company's email network. The assignment includes coding tasks using Python libraries such as NetworkX and scikit-learn, with specific requirements for returning results in the form of lists and Pandas series. Grading is based on the accuracy of predictions measured by the Area Under the ROC Curve (AUC).

Uploaded by

mymail8795
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

10/16/25, 8:24 PM assignment4

Assignment4 (Score: 3.0 / 3.0)


1. Test cell (Score: 1.0 / 1.0)
2. Test cell (Score: 1.0 / 1.0)
3. Test cell (Score: 1.0 / 1.0)

Assignment 4
In [1]: import networkx as nx
import pandas as pd
import numpy as np
import pickle

Part 1 - Random Graph Identification


For the first part of this assignment you will analyze randomly generated graphs and determine which
algorithm created them.

In [2]: G1 = nx.read_gpickle("assets/A4_P1_G1")
G2 = nx.read_gpickle("assets/A4_P1_G2")
G3 = nx.read_gpickle("assets/A4_P1_G3")
G4 = nx.read_gpickle("assets/A4_P1_G4")
G5 = nx.read_gpickle("assets/A4_P1_G5")
P1_Graphs = [G1, G2, G3, G4, G5]

P1_Graphs is a list containing 5 networkx graphs. Each of these graphs were generated by one of three
possible algorithms:

Preferential Attachment ( 'PA' )


Small World with low probability of rewiring ( 'SW_L' )
Small World with high probability of rewiring ( 'SW_H' )

Anaylze each of the 5 graphs using any methodology and determine which of the three algorithms
generated each graph.

The graph_identification function should return a list of length 5 where each element in the list is
either 'PA' , 'SW_L' , or 'SW_H' .

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 1/8
10/16/25, 8:24 PM assignment4

In [3]: Student's answer (Top)

def graph_identification():
# Analyze graphs to identify which algorithm generated them
# PA (Preferential Attachment): high clustering, power-law de
gree distribution
# SW_L (Small World Low rewiring): high clustering, low avera
ge shortest path
# SW_H (Small World High rewiring): low clustering, low avera
ge shortest path

results = []

for g in P1_Graphs:
# Calculate key metrics
avg_clustering = nx.average_clustering(g)
try:
avg_path_length = nx.average_shortest_path_length(g)
except:
# If graph is disconnected, use the largest component
largest_cc = max(nx.connected_components(g), key=len)
subgraph = g.subgraph(largest_cc)
avg_path_length = nx.average_shortest_path_length(sub
graph)

# Degree distribution analysis


degrees = [d for n, d in g.degree()]
avg_degree = np.mean(degrees)
std_degree = np.std(degrees)
max_degree = np.max(degrees)

# Classification logic:
# PA: Very high degree variance, some nodes with very hig
h degree
# SW_L: High clustering, relatively low path length
# SW_H: Low clustering, low path length

degree_variance = std_degree / avg_degree if avg_degree >


0 else 0

# PA has very high degree variance (power law)


if degree_variance > 1.5 or max_degree > 3 * avg_degree:
results.append('PA')
# SW_L has high clustering coefficient
elif avg_clustering > 0.4:
results.append('SW_L')
# SW_H has lower clustering
else:
results.append('SW_H')

return results

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 2/8
10/16/25, 8:24 PM assignment4

In [4]: Grade cell: cell-efb9da7e1c19accf Score: 1.0 / 1.0 (Top)

ans_one = graph_identification()
assert type(ans_one) == list, "You must return a list"

Part 2 - Company Emails


For the second part of this assignment you will be working with a company's email network where each
node corresponds to a person at the company, and each edge indicates that at least one email has been
sent between two people.

The network also contains the node attributes Department and ManagmentSalary .

Department indicates the department in the company which the person belongs to, and
ManagmentSalary indicates whether that person is receiving a managment position salary.

In [5]: G = pickle.load(open('assets/email_prediction_NEW.txt', 'rb'))

print(f"Graph with {len(nx.nodes(G))} nodes and {len(nx.edges(G))}


edges")

Graph with 1005 nodes and 16706 edges

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 3/8
10/16/25, 8:24 PM assignment4

Part 2A - Salary Prediction


Using network G , identify the people in the network with missing values for the node attribute
ManagementSalary and predict whether or not these individuals are receiving a managment position
salary.

To accomplish this, you will need to create a matrix of node features of your choice using networkx, train a
sklearn classifier on nodes that have ManagementSalary data, and predict a probability of the node
receiving a managment salary for nodes where ManagementSalary is missing.

Your predictions will need to be given as the probability that the corresponding employee is receiving a
managment position salary.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of
0.75 or higher will recieve full points.

Using your trained classifier, return a Pandas series of length 252 with the data being the probability of
receiving managment salary, and the index being the node id.

Example:

1 1.0
2 0.0
5 0.8
8 1.0
...
996 0.7
1000 0.5
1001 0.0
Length: 252, dtype: float64

In [6]: list(G.nodes(data=True))[:5] # print the first 5 nodes

Out[6]: [(0, {'Department': 1, 'ManagementSalary': 0.0}),


(1, {'Department': 1, 'ManagementSalary': nan}),
(581, {'Department': 3, 'ManagementSalary': 0.0}),
(6, {'Department': 25, 'ManagementSalary': 1.0}),
(65, {'Department': 4, 'ManagementSalary': nan})]

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 4/8
10/16/25, 8:24 PM assignment4

In [7]: Student's answer (Top)

from sklearn.neural_network import MLPClassifier


from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import networkx as nx

def salary_predictions():
def is_management(node):
managementSalary = node[1]['ManagementSalary']
if managementSalary == 0:
return 0
elif managementSalary == 1:
return 1
else:
return None

df = pd.DataFrame(index=G.nodes())
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = [val for node, val in G.degree()]
df['degree_centrality'] = [val for node, val in nx.degree_cen
trality(G).items()]
df['closeness'] = [val for node, val in nx.closeness_centrali
ty(G).items()]
df['betweeness'] = [val for node, val in nx.betweenness_centr
ality(G).items()]
df['pr'] = [val for node, val in nx.pagerank(G).items()]
df['is_management'] = pd.Series([is_management(node) for node
in G.nodes(data=True)], index=df.index)

df_train = df[~pd.isnull(df['is_management'])]
df_test = df[pd.isnull(df['is_management'])]
features = ['clustering', 'degree', 'degree_centrality', 'clo
seness', 'betweeness', 'pr']
X_train = df_train[features]
Y_train = df_train['is_management']
X_test = df_test[features]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = MLPClassifier(hidden_layer_sizes=[10, 5], alpha=5,
random_state=0, solver='lbfgs', verbose=0)
clf.fit(X_train_scaled, Y_train)
test_proba = clf.predict_proba(X_test_scaled)[:, 1]
return pd.Series(test_proba, X_test.index)

In [8]: Grade cell: cell-bc9c23e7517908ab Score: 1.0 / 1.0 (Top)

ans_salary_preds = salary_predictions()
assert type(ans_salary_preds) == pd.core.series.Series, "You must
return a Pandas series"
assert len(ans_salary_preds) == 252, "The series must be of lengt
h 252"

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 5/8
10/16/25, 8:24 PM assignment4

Part 2B - New Connections Prediction


For the last part of this assignment, you will predict future connections between employees of the network.
The future connections information has been loaded into the variable future_connections . The index is
a tuple indicating a pair of nodes that currently do not have a connection, and the Future Connection
column indicates if an edge between those two nodes will exist in the future, where a value of 1.0 indicates
a future connection.

In [9]: future_connections = pd.read_csv('assets/Future_Connections.csv', i


ndex_col=0, converters={0: eval})
future_connections.head(10)

Out[9]:
Future Connection

(6, 840) 0.0

(4, 197) 0.0

(620, 979) 0.0

(519, 872) 0.0

(382, 423) 0.0

(97, 226) 1.0

(349, 905) 0.0

(429, 860) 0.0

(309, 989) 0.0

(468, 880) 0.0

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 6/8
10/16/25, 8:24 PM assignment4

Using network G and future_connections , identify the edges in future_connections with missing
values and predict whether or not these edges will have a future connection.

To accomplish this, you will need to:

1. Create a matrix of features of your choice for the edges found in future_connections using
Networkx
2. Train a sklearn classifier on those edges in future_connections that have Future Connection
data
3. Predict a probability of the edge being a future connection for those edges in future_connections
where Future Connection is missing.

Your predictions will need to be given as the probability of the corresponding edge being a future
connection.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUC of
0.75 or higher will recieve full points.

Using your trained classifier, return a series of length 122112 with the data being the probability of the
edge being a future connection, and the index being the edge as represented by a tuple of nodes.

Example:

(107, 348) 0.35


(542, 751) 0.40
(20, 426) 0.55
(50, 989) 0.35
...
(939, 940) 0.15
(555, 905) 0.35
(75, 101) 0.65
Length: 122112, dtype: float64

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 7/8
10/16/25, 8:24 PM assignment4

In [10]: Student's answer (Top)

def new_connections_predictions():

from sklearn.ensemble import GradientBoostingClassifier

future_connections['pref_attachment'] = [list(nx.preferential
_attachment(G, [node_pair]))[0][2]
for node_pair in fut
ure_connections.index]
future_connections['comm_neighbors'] = [len(list(nx.common_ne
ighbors(G, node_pair[0], node_pair[1])))
for node_pair in futu
re_connections.index]
train_data = future_connections[~future_connections['Future C
onnection'].isnull()]
test_data = future_connections[future_connections['Future Con
nection'].isnull()]
clf = GradientBoostingClassifier()
clf.fit(train_data[['pref_attachment','comm_neighbors']].valu
es, train_data['Future Connection'].values)
preds = clf.predict_proba(test_data[['pref_attachment','comm_
neighbors']].values)[:,1]
return pd.Series(preds, index=test_data.index)

new_connections_predictions()

Out[10]: (107, 348) 0.031823


(542, 751) 0.012931
(20, 426) 0.543026
(50, 989) 0.013104
(942, 986) 0.013103
...
(165, 923) 0.013183
(673, 755) 0.013103
(939, 940) 0.013103
(555, 905) 0.012931
(75, 101) 0.017730
Length: 122112, dtype: float64

In [11]: Grade cell: cell-979b4a17d794f3d0 Score: 1.0 / 1.0 (Top)

ans_prob_preds = new_connections_predictions()
assert type(ans_prob_preds) == pd.core.series.Series, "You must r
eturn a Pandas series"
assert len(ans_prob_preds) == 122112, "The series must be of leng
th 122112"

In [ ]:

This assignment was graded by mooc_adswpy:9154b96e4479, v1.37.030923

https://www.coursera.org/api/rest/v1/executorruns/richfeedback?id=KXJ2DKdOEfCYSQr_4Hq04Q&feedbackType=HTML 8/8

You might also like