You can create GraphFrames where there are edges that point at src and dst ids that do not exist in the id field of the vertices. I want to improve instantiation by complaining and dying if an INNER join of the edges against the nodes for both src and dst doesn't produce the same number of edges from a raw edge count.
Something like this:
# Sanity test that all edges have valid ids
edge_count = g.edges.count()
valid_edge_count = (
g.edges.join(g.vertices, on=g.edges.src == g.vertices.id)
.select("src", "dst", "relationship")
.join(g.vertices, on=g.edges.dst == g.vertices.id)
.count()
)
# Just up and die if we have edges that point to non-existent nodes
assert (
edge_count == valid_edge_count
), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
print(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}")
Which for a valid GraphFrame of the stats.meta Stack Exchange knowledge graph prints:
Edge count: 97,104 == Valid edge count: 97,104
I can take this opportunity to count and print node and edge counts upon instantiation as well, since I will be calculating them.