import random
%matplotlib inline
import networkx as nx
A partition of a graph is a separation of its nodes into disjoint groups. Consider the following graph:
G = nx.Graph()
nx.add_cycle(G, [0, 1, 2, 3])
nx.add_cycle(G, [4, 5, 6, 7, 4])
G.add_edge(0, 7)
nx.draw(G, with_labels=True)
The following is an example of a partition of these nodes:
partition = [
{1, 2, 3},
{4, 5, 6},
{0, 7},
]
Observe that every node in the graph is in exactly one of the sets in the partition. Formally, a partition is a list of sets such that every node is in exactly one set. NetworkX can verify that our partition is valid:
nx.community.is_partition(G, partition)
True
When developing community detection algorithms, we often make use of a partition map, which is a dictionary mapping node names to a partition index. This is useful for quickly comparing if two nodes are in the same cluster in the partition:
partition_map = {}
for idx, cluster_nodes in enumerate(partition):
for node in cluster_nodes:
partition_map[node] = idx
partition_map
{1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 0: 2, 7: 2}
In this dictionary, the keys are the node names and two nodes will have the same value if they are in the same partition:
partition_map[0] == partition_map[7]
True
We can visualize our partition by drawing the graph with nodes colored by their partition membership:
node_colors = [partition_map[n] for n in G.nodes]
nx.draw(G, node_color=node_colors, with_labels=True)
There are two trivial partitions:
A valid partition thus contains between 1 and N sets.
Feel free to experiment by changing the partition above and running the subsequent cells.
At a high level, network community detection consists of finding a partition that achieves good separation between the groups of nodes. Before we get into how to find good partitions of a graph, we need an objective -- a way to measure how good the partition is. Modularity is one such objective function.
The modularity of a graph partition compares the number of intra-group edges with a random baseline. Higher modularity scores correspond to a higher proportion of intra-group edges, therefore fewer inter-group edges and better separation of groups.
For weighted undirected networks, as described in the text, we have \begin{equation} Q_w=\frac{1}{W}\sum_C \left(W_C-\frac{s_C^2}{4W}\right), \label{eq:wmodul} \end{equation} where
The total weight $W$ is half the total strength for the same reason that the number of edges $L$ is half the total degree. While this formula may look a bit complicated, it's straightforward to write code to compute the sum:
def modularity(G, partition):
W = sum(G.edges[v, w].get('weight', 1) for v, w in G.edges)
summation = 0
for cluster_nodes in partition:
s_c = sum(G.degree(n, weight='weight') for n in cluster_nodes)
# Use subgraph to count only internal links
C = G.subgraph(cluster_nodes)
W_c = sum(C.edges[v, w].get('weight', 1) for v, w in C.edges)
summation += W_c - s_c ** 2 / (4 * W)
return summation / W
modularity(G, partition)
0.26
Let's compare this to a partition we would suspect to have higher modularity:
partition_2 = [
{0, 1, 2, 3},
{4, 5, 6, 7},
]
modularity(G, partition_2)
0.395
NetworkX provides a modularity function that is more efficient than ours:
nx.community.quality.modularity(G, partition_2)
0.395
When writing and testing community-detection algorithms, it helps to make use of benchmark networks: graphs with a known, "natural" community structure. Perhaps the most famous benchmark graph is Zachary's Karate Club. It contains 34 nodes, representing members of a karate club whose interactions were monitored over a period of three years by researchers. Links in this graph connect individuals interacting outside club activities, a proxy for social ties.
During the course of the study, a conflict between the instructor Mr. Hi (node 0) and the president, or Officer (node 33) led to a split of the club into separate groups led by Mr. Hi and Officer. In this case we know whom each member of the group followed after the split, providing empirical community labels: those members who followed Mr. Hi are said to be one community and those following the Officer make up the other.
For this graph, we assume that the post-split group composition was largely driven by the social ties: members of the same friend groups would want to be part of the same club after the split. We thus expect a good community-detection algorithm to predict the post-split group composition with high accuracy.
Zachary's karate club is such a popular benchmark graph that it has its own function in NetworkX:
K = nx.karate_club_graph()
nx.draw(K, with_labels=True)
Each node in a NetworkX graph has a dictionary of attributes associated with it. This dictionary can hold arbitrary data about a node. We can get the attributes for a single node by giving the node name to the nodes
object.
Each node in this graph has a 'club'
attribute, indicating whether the member followed the instructor or the president after the split:
K.nodes[0]
{'club': 'Mr. Hi'}
K.nodes[9]
{'club': 'Officer'}
We can visualize these labels by coloring each node according to its 'club'
attribute:
K = nx.karate_club_graph()
club_color = {
'Mr. Hi': 'orange',
'Officer': 'lightblue',
}
node_colors = [club_color[K.nodes[n]['club']] for n in K.nodes]
nx.draw(K, node_color=node_colors, with_labels=True)
This separation looks good, in that there are relatively few inter-community links as opposed to intra-community links. Let's create a graph partition based on these labels and measure its modularity.
We can do this by creating a dictionary of two sets, one for each value of the nodes' 'club'
attribute, then assigning the nodes to the corresponding set.
groups = {
'Mr. Hi': set(),
'Officer': set(),
}
for n in K.nodes:
club = K.nodes[n]['club']
groups[club].add(n)
groups
{'Mr. Hi': {0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 19, 21}, 'Officer': {9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}}
By using the dictionary's .values()
method, we can get a list of sets that define our partition:
empirical_partition = list(groups.values())
empirical_partition
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 19, 21}, {9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}]
nx.community.is_partition(K, empirical_partition)
True
Since our partition is indeed a valid partition, we can get the modularity of this partition:
nx.community.quality.modularity(K, empirical_partition)
0.39143756676224206
This is a relatively high modularity, which is what we expect.
For the sake of comparison, let's generate a random partition of this network and check its modularity. We would expect a modularity close to zero in this case.
First we generate a sample of 17 nodes, half the total number of nodes, and assign them to one community. Our second community then includes the nodes in the graph not in the first community. We can use some set arithmetic to do this concisely:
random_nodes = random.sample(K.nodes, 17)
random_partition = [set(random_nodes),
set(K.nodes) - set(random_nodes)]
random_partition
[{1, 3, 4, 7, 9, 10, 12, 17, 18, 20, 21, 23, 25, 26, 28, 29, 33}, {0, 2, 5, 6, 8, 11, 13, 14, 15, 16, 19, 22, 24, 27, 30, 31, 32}]
We can visualize this partition and observe that the communities are much less natural-looking, as we would expect from a random assignment.
random_node_colors = ['orange' if n in random_nodes else 'lightblue' for n in K.nodes]
nx.draw(K, node_color=random_node_colors)
And finally we can test the modularity of this partition:
nx.community.quality.modularity(K, random_partition)
0.004488296696088884
Since this is a random process the modularity won't be exactly zero, but it should be fairly close. Go ahead and repeat the process of generating a random partition and testing its modularity -- it will fluctuate around its mean value of zero.
Our task in this part will be to implement the Girvan-Newman clustering algorithm. Since NetworkX can do the heavy lifting for us -- computing betweenness centrality -- the code part of the task is relatively straightforward. Most of our effort here is spent interpreting and explaining intermediate results.
Recall from the text the Girvan-Newman clustering algorithm:
During this process, the number of connected components in the graph will increase monotonically as clusters are broken up. Since we are removing one link at a time, the number of connected components can increase by at most one between steps in the sequence -- it's not possible for a single edge to connect more than two nodes, and thus components.
We hope that the resulting partition of the graph will approximate its underlying community structure. We'll use the Karate Club graph here because we know the ground-truth community labels and can compare the result obtained from the algorithm.
G = nx.karate_club_graph()
nx.draw(G)
nx.edge_betweenness_centrality(G)
{(0, 1): 0.025252525252525245, (0, 2): 0.0777876807288572, (0, 3): 0.02049910873440285, (0, 4): 0.0522875816993464, (0, 5): 0.07813428401663694, (0, 6): 0.07813428401663695, (0, 7): 0.0228206434088787, (0, 8): 0.07423959482783014, (0, 10): 0.0522875816993464, (0, 11): 0.058823529411764705, (0, 12): 0.04652406417112298, (0, 13): 0.04237189825425121, (0, 17): 0.04012392835922248, (0, 19): 0.045936960642843, (0, 21): 0.040123928359222474, (0, 31): 0.1272599949070537, (1, 2): 0.023232323232323233, (1, 3): 0.0077243018419489, (1, 7): 0.007422969187675069, (1, 13): 0.01240556828792123, (1, 17): 0.01869960105254222, (1, 19): 0.014633732280791102, (1, 21): 0.01869960105254222, (1, 30): 0.032280791104320514, (2, 3): 0.022430184194890075, (2, 7): 0.025214328155504617, (2, 8): 0.009175791528732704, (2, 9): 0.030803836686189627, (2, 13): 0.007630931160342923, (2, 27): 0.04119203236850296, (2, 28): 0.02278244631185807, (2, 32): 0.06898678663384543, (3, 7): 0.003365588659706307, (3, 12): 0.012299465240641705, (3, 13): 0.01492233256939139, (4, 6): 0.0047534165181224, (4, 10): 0.0029708853238265, (5, 6): 0.0029708853238265003, (5, 10): 0.0047534165181224, (5, 16): 0.029411764705882353, (6, 16): 0.029411764705882353, (8, 30): 0.00980392156862745, (8, 32): 0.0304416716181422, (8, 33): 0.04043657867187279, (9, 33): 0.029615482556659026, (13, 33): 0.06782389723566191, (14, 32): 0.024083977025153497, (14, 33): 0.03473955238661121, (15, 32): 0.024083977025153497, (15, 33): 0.03473955238661121, (18, 32): 0.024083977025153497, (18, 33): 0.03473955238661121, (19, 33): 0.05938233879410351, (20, 32): 0.024083977025153497, (20, 33): 0.03473955238661121, (22, 32): 0.024083977025153493, (22, 33): 0.03473955238661121, (23, 25): 0.019776193305605066, (23, 27): 0.010536739948504653, (23, 29): 0.00665478312537136, (23, 32): 0.022341057635175278, (23, 33): 0.03266983561101209, (24, 25): 0.0042186571598336305, (24, 27): 0.018657159833630418, (24, 31): 0.040106951871657755, (25, 31): 0.04205783323430383, (26, 29): 0.004532722179781003, (26, 33): 0.0542908072319837, (27, 33): 0.030477039300568713, (28, 31): 0.0148544266191325, (28, 33): 0.024564977506153975, (29, 32): 0.023328523328523323, (29, 33): 0.029807882749059215, (30, 32): 0.01705288175876411, (30, 33): 0.02681436210847975, (31, 32): 0.04143394731630026, (31, 33): 0.05339388280564752, (32, 33): 0.008225108225108224}
The resulting dictionary has edge tuples as the keys, and each associated value is the betweenness centrality of that edge. The algorithm to compute the edge betweenness of all edges in a graph costs about the same as calculating it for a single edge, so we'll make use of this dictionary with the computed values for every edge.
Once computed for all edges, we can easily get the associated betweenness for a single edge. For example, to get the edge betweenness of the edge between nodes 0 and 1:
my_edge_betweenness = nx.edge_betweenness_centrality(G)
my_edge_betweenness[0, 1]
0.025252525252525245
Recall that dictionaries also have the .get
method. This will be used in the next step.
my_edge_betweenness.get((0, 1))
0.025252525252525245
Given this dictionary of betweenness values for each edge, we can make use of Python's builtin max
function to give us the key in this dictionary with the greatest value. Since there is a key in the dictionary for each edge in the graph, the following two expressions are equivalent, but the second one is probably more explicit as to what we're doing with this statement.
I'm using the name my_edge_betweenness
to make clear that this is a dictionary we've named and not a NetworkX function.
max(my_edge_betweenness, key=my_edge_betweenness.get)
(0, 31)
max(G.edges(), key=my_edge_betweenness.get)
(0, 31)
This is then the edge we want to remove at this step in the process:
my_edge_betweenness = nx.edge_betweenness_centrality(G)
most_valuable_edge = max(G.edges(), key=my_edge_betweenness.get)
G.remove_edge(*most_valuable_edge)
The "splat" in the last statement above G.remove_edge(*most_valuable_edge)
performs tuple unpacking into the arguments of a function. For example, if our most valuable edge is (0, 31)
,
G.remove_edge(*most_valuable_edge)
is the same as
G.remove_edge(most_valuable_edge[0], most_valuable_edge[1])
or
G.remove_edge(0, 31)
This one is almost a gimme because the nx.connected_components()
function gives us almost exactly what we want:
nx.connected_components(G)
<generator object connected_components at 0x7fbd5acd5b30>
We just have to remember to ask for it in a list:
list(nx.connected_components(G))
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}]
Remember: a partition is a list of sets where every node is in exactly one of these sets. This is just what we have here, although it's a bit boring since we've only removed one edge and so there is still one connected component. If you like, you can try running the previous two cells a few times until you have more than one connected component so you can see what that looks like.
Note that this feature whereby the connected components correspond exactly to our putative community labels is particular to the Girvan-Newman algorithm: other clustering algorithms may use different ways of generating their partitions.
This implies that we need a loop to repeat this process $L$ times, once for each edge, and that we should keep track of the partitions generated. Straightforward stuff. We'll start with a fresh Karate Club graph since we removed some edges above:
G = nx.karate_club_graph()
partition_sequence = []
for _ in range(G.number_of_edges()):
my_edge_betweenness = nx.edge_betweenness_centrality(G)
most_valuable_edge = max(G.edges(), key=my_edge_betweenness.get)
G.remove_edge(*most_valuable_edge)
my_partition = list(nx.connected_components(G))
partition_sequence.append(my_partition)
Note the idiomatic construction of this for
loop. Using _
as the name for the loop variable tells the reader that we don't expect to do anything with the loop variable -- we just want to perform a task a specific number of times. One might be tempted to use a while
loop here, but that way lie dragons: a mistake in a while
loop can lead to infinite loops which are a headache.
If we've done this right, we should have a partition for each step of the process, i.e. one for each edge in the graph:
len(partition_sequence), nx.karate_club_graph().number_of_edges()
(78, 78)
Since we started with a connected graph, removing one edge probably doesn't disconnect the graph, so our first partition probably only has one community:
len(partition_sequence[0])
1
...and the last partition should also be trivial, with each node in its own community:
len(partition_sequence[-1]), nx.karate_club_graph().number_of_nodes()
(34, 34)
We now have a sequence of partitions and a function to calculate the modularity of a partition. This is a great time to use a list comprehension!
G = nx.karate_club_graph()
modularity_sequence = [modularity(G, p) for p in partition_sequence]
modularity_sequence
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.34766027623170476, 0.34766027623170476, 0.34766027623170476, 0.3423192968647515, 0.3423192968647515, 0.3423192968647515, 0.3423192968647515, 0.3580611307884035, 0.3580611307884035, 0.3580611307884035, 0.3580611307884035, 0.3580611307884035, 0.3580611307884035, 0.38497217068645645, 0.37578006409175246, 0.37578006409175246, 0.3594760218136842, 0.3594760218136842, 0.3470699574595679, 0.3470699574595679, 0.333249002080171, 0.333249002080171, 0.3134405277262421, 0.3134405277262421, 0.3122598901819681, 0.3122598901819681, 0.3036862127771219, 0.3036862127771219, 0.2942973332583722, 0.2942973332583722, 0.2827158411573995, 0.2827158411573995, 0.27116245947414774, 0.27116245947414774, 0.2544648713479881, 0.2544648713479881, 0.2397537527407657, 0.2397537527407657, 0.22689792170311643, 0.22299057363992422, 0.22299057363992422, 0.22299057363992422, 0.20056783043796034, 0.20056783043796034, 0.18696238826108955, 0.18696238826108955, 0.18696238826108955, 0.16091340117314143, 0.16091340117314143, 0.16091340117314143, 0.14281029216094152, 0.14281029216094152, 0.11768894885778004, 0.11088622776934466, 0.10076647738985403, 0.08837915331421826, 0.08837915331421826, 0.05623957572009519, 0.04398343359382318, 0.04398343359382318, 0.04398343359382318, 0.011515901126290725, 0.011515901126290725, -0.003504432075860656, -0.031052641442251845, -0.031052641442251845, -0.04655085174565696, -0.05110473941642774]
This sequence is then the modularity of the partition at each step in the algorithm. The first several entries in this sequence are effectively zero while there is only one community/component, then it jumps up once there is more than one community. We can use pyplot to visualize this relationship:
import matplotlib.pyplot as plt
plt.plot(modularity_sequence)
plt.ylabel('Modularity')
plt.xlabel('Algorithm step')
Text(0.5, 0, 'Algorithm step')
Visually, we see a peak in the modularity sequence. This is the partition that maximizes modularity, and thus the output of the algorithm. We can use the max
function to get the partition with highest modularity. Ideally we want to write the following:
best_partition = max(partition_sequence, key=nx.community.quality.modularity)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /afs/inf.ed.ac.uk/user/s09/s0952880/DbbaMaterial/Tutorial 4/4_Modularity.ipynb Cell 78 in <cell line: 1>() ----> <a href='vscode-notebook-cell://ssh-remote%2Bs0952880.pgr.inf.ed.ac.uk/afs/inf.ed.ac.uk/user/s09/s0952880/DbbaMaterial/Tutorial%204/4_Modularity.ipynb#Y140sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> best_partition = max(partition_sequence, key=nx.community.quality.modularity) TypeError: modularity() missing 1 required positional argument: 'communities'
...but we receive an error. Recall that a key function must take exactly one argument, the item in the sequence being evaluated, but the modularity function takes two arguments: the graph and the partition. We can fix this by writing a single-argument function to use as the key:
def my_modularity(partition):
return nx.community.quality.modularity(G, partition)
best_partition = max(partition_sequence, key=my_modularity)
Advanced Pythonauts will see a differet solution to this using the zip
function to align the previously-generated partition & modularity sequences, but this solution is more explicit.
So after all that work, what is the best partition?
best_partition
[{0, 1, 3, 7, 11, 12, 13, 17, 19, 21}, {2, 24, 25, 27, 28, 31}, {4, 5, 6, 10, 16}, {8, 14, 15, 18, 20, 22, 23, 26, 29, 30, 32, 33}, {9}]
Interesting! The partition of the karate club graph with highest modularity actually has five components! Let's visualize them, using our code for partition maps we wrote back at the beginning of this tutorial:
def create_partition_map(partition):
partition_map = {}
for idx, cluster_nodes in enumerate(partition):
for node in cluster_nodes:
partition_map[node] = idx
return partition_map
best_partition_map = create_partition_map(best_partition)
node_colors = [best_partition_map[n] for n in G.nodes()]
nx.draw(G, with_labels=True, node_color=node_colors)
Exactly how good is this five-community clustering?
nx.community.quality.modularity(G, best_partition)
0.3849721706864564
It's higher than the "ground truth" communities we evaluated in section 3, which is a good sign, but for the specific problem of trying to predict the post-split community membership, a clustering into five groups is useless to us.
One of the most useful parts of the Girvan-Newman algorithm is that it is also useful when we have a specific number of clusters we want. In this case, we know the karate club split into two groups, so let's get the partition in the sequence with two components:
for partition in partition_sequence:
if len(partition) == 2:
two_cluster_partition = partition
break
two_cluster_partition
[{0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21}, {2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}]
two_cluster_partition_map = create_partition_map(two_cluster_partition)
node_colors = [two_cluster_partition_map[n] for n in G.nodes()]
nx.draw(G, with_labels=True, node_color=node_colors)
How good is this partition? We can get its modularity:
nx.community.quality.modularity(G, two_cluster_partition)
0.3476602762317048
Pretty good -- comparable to the ground truth community labels. Let's compare these side-by-side:
import matplotlib.pyplot as plt
pos = nx.layout.spring_layout(G)
fig = plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
two_cluster_partition_map = create_partition_map(two_cluster_partition)
node_colors = [two_cluster_partition_map[n] for n in G.nodes()]
nx.draw(G, with_labels=True, node_color=node_colors, pos=pos)
plt.title('Predicted communities')
plt.subplot(1, 2, 2)
node_colors = [G.nodes[n]['club'] == 'Officer' for n in G.nodes()]
nx.draw(G, with_labels=True, node_color=node_colors, pos=pos)
plt.title('Actual communities')
Text(0.5, 1.0, 'Actual communities')
We can see that the predicted community labels are pretty accurate, only differing on a couple nodes that, visually, seem like they could plausibly beling to either group. Zachary's original paper even explains the practical considerations of one of these mispredicted nodes: student 8 was very near receiving his black belt from Mr. Hi and thus did not want to leave the group even though several of his friends did.
G.nodes[8]
{'club': 'Mr. Hi'}
The astute reader might note that there may be several two-cluster partitions in the partition sequence we generated. We assert the following to be true:
Proving these is left as an exercise to the reader, but as a consequence of these being true, optimized implementations of Girvan-Newman clustering will only store one partition for each number of clusters. This is how the implementation in NetworkX works, only providing one partition for each number of communities greater than one.
nx.community.girvan_newman(G)
will generate a sequence containing one partition of each size greater than one. Here we can see the first several are the same as those we generated:
list(nx.community.girvan_newman(G))[:5]
[({0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21}, {2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}), ({0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21}, {2, 8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}, {9}), ({0, 1, 3, 7, 11, 12, 13, 17, 19, 21}, {2, 8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}, {4, 5, 6, 10, 16}, {9}), ({0, 1, 3, 7, 11, 12, 13, 17, 19, 21}, {2, 24, 25, 27, 28, 31}, {4, 5, 6, 10, 16}, {8, 14, 15, 18, 20, 22, 23, 26, 29, 30, 32, 33}, {9}), ({0, 1, 3, 7, 12, 13, 17, 19, 21}, {2, 24, 25, 27, 28, 31}, {4, 5, 6, 10, 16}, {8, 14, 15, 18, 20, 22, 23, 26, 29, 30, 32, 33}, {9}, {11})]
Find communities in the interbank network available from the website by using both Girvan-Newman and Louvain. Discuss the results.
We load the exposure dataset as we did for Tutorial 2.
However, this time we construct an undirected network of total loans/lending between countries, as the networkx's Louvain clustering cannot handle directed networks.
Do this affects how we can normalize the edge weights, which now must consider the strength of both incident nodes.
The geometric mean (sqrt(src strength * tar strength)
) is a suitable choice for denominator.
Because, compared to the arithmetic mean (src strength + tar strength / 2
), the geometric mean is more sensitive to the imbalance in node strength.
This is desirable as the smaller country is more likely to be the limiting factor of a loan regardless of who is the lender.
An issue applying the Girvan-Newman algorithm is that relies on disconnecting communities by removing edges with high betweenness. Unfortunately, our network is fully connected with edge weights that should be interpretted as a positive association between the edges, rather than distances necessary to compute betweenness. Our solution here is to simply invert the edge weights when calculating the most valuable edge to remove from the network.
from math import sqrt
import networkx as nx
import networkx.algorithms.community as nx_comm
import pandas as pd
import matplotlib as mpl
# Conversion function that performs edge weight summation:
def sum_directions(G_di: nx.DiGraph) -> nx.Graph:
from collections import defaultdict
G = nx.Graph()
G.add_nodes_from(G_di.nodes.data())
edge_dict = defaultdict(lambda: defaultdict(lambda: 0))
for s, t, data in G_di.edges.data():
for k, v in data.items():
edge_dict[min(s, t), max(s, t)][k] += v
G.add_edges_from([
(s, t, dict(data))
for (s, t), data in edge_dict.items()
])
return G
# Load exposure dataset ignoring zero weights.
exposuredf = pd.read_csv('exposure_dataset.csv')
exposuredf = exposuredf[exposuredf.weight > 0]
# Louvain cannot handle directed graph, so convert to undirected graph.
G_interbank = sum_directions(
nx.from_pandas_edgelist(
exposuredf,
edge_attr = 'weight',
create_using = nx.DiGraph() # Preserve edge directions for `sum_directions`
)
)
# Normalise edges by strength of both incident nodes.
node_strength = G_interbank.degree(weight = 'weight')
for src, tar, wgt in G_interbank.edges.data('weight'):
# Normalised in the range 0-1, with values being positive.
norm_wgt = wgt / sqrt(node_strength[src] * node_strength[tar])
G_interbank.edges[src, tar]['norm_weight'] = norm_wgt
G_interbank.edges[src, tar]['norm_dist'] = 1 / wgt
# Girvan-Newman clustering using the computed and normalised distances.
def norm_betweenness(G):
betweenness = nx.edge_betweenness_centrality(G, weight = 'norm_dist')
return lambda e: betweenness[e]
def most_valuable_edge(G):
return max(G.edges, key = norm_betweenness(G))
def get_modularity(part):
return nx_comm.modularity(G_interbank, part, weight = 'norm_weight')
initial_centrality = norm_betweenness(G_interbank)
comm_girvan_newman = max(
nx.community.girvan_newman(G_interbank, most_valuable_edge),
key = get_modularity
)
comm_louvain = nx_comm.louvain_communities(
G_interbank, weight = 'norm_weight'
)
## PLOTTING ##
part_girvan_newman = create_partition_map(comm_girvan_newman)
part_louvain = create_partition_map(comm_louvain)
plt.figure(figsize = (18, 8))
plt.subplot(1, 2, 1)
plt.title(f'Girvan-Newman: {get_modularity(comm_girvan_newman):.3f} modularity')
nx.draw(
G_interbank,
pos = nx.circular_layout(G_interbank),
with_labels = True,
node_color = [part_girvan_newman[n] for n in G_interbank.nodes()],
cmap = mpl.cm.tab10,
edge_color = [wgt for _, _, wgt in G_interbank.edges.data('norm_weight')],
edge_cmap = mpl.cm.Greys)
plt.subplot(1, 2, 2)
plt.title(f'Louvain: {get_modularity(comm_louvain):.3f} modularity')
nx.draw(
G_interbank,
pos = nx.circular_layout(G_interbank),
with_labels = True,
node_color = [part_louvain[n] for n in G_interbank.nodes()],
cmap = mpl.cm.tab10,
edge_color = [wgt for _, _, wgt in G_interbank.edges.data('norm_weight')],
edge_cmap = mpl.cm.Greys)
Comparing the two methods we can see that whilst the Louvain method has produced uncovered a more optimial partition, with over double the modularity of Girvan-Newman. The community structure is also more nuanced, containing more (and more equally sized) communities than Girvan-Newman.
The reduced modularity of the Girvan-Newman partition is not unsurprising. Since the network is fully connected, the shortest path between any two nodes is based solely on the edge weights. Thus, edge betweenness biases heavily towards ignoring edges most important to increasing modularity. Each step in Girvan-Newman reduces the modularity further by dividing based on a strong link. The opposite is seen in the Louvain communities, many of which centered around a number of particularly strong links.
plt.title('Girvan-Newman Modularity')
plt.plot([
get_modularity(part)
for part in nx_comm.girvan_newman(G_interbank, most_valuable_edge)
])
plt.xlabel('Step')
plt.ylabel('Modularity')
plt.show()
Use the stock price correlation dataset to compute 1) the minimum spanning tree using distances 2) the maximum spanning tree using correlations. Use an algorithm of your choice to find communities for both, and discuss the differences.
We load last year's FTSE 100 adjusted returns and calculate the maximum and minimum spanning trees as we did in Tutorial 1.
import pandas as pd
G_corr = nx.from_pandas_adjacency(
pd.read_csv('ret_adjusted.csv')
.set_index('ret.adjusted.prices.ref.date')
.rename(columns = lambda k: k.split('.')[3])
.corr()
)
tree_min = nx.minimum_spanning_tree(G_corr)
tree_max = nx.maximum_spanning_tree(G_corr)
def plot_communities(tree):
def my_modularity(part):
return nx.community.quality.modularity(tree, part)
part = create_partition_map(max(
nx.community.girvan_newman(tree),
key = my_modularity))
nx.draw(
tree,
pos = nx.spring_layout(tree, k = .1, iterations = 500, weight = None),
with_labels = True,
nodelist = list(tree.nodes()),
node_color = [part[n] for n in tree.nodes()],
cmap = mpl.cm.tab10,
edge_color = '#A0CBE2')
plt.figure(figsize = (20, 10))
plt.subplot(1, 2, 1)
plt.title('Minimum Spanning Tree')
plot_communities(tree_min)
plt.subplot(1, 2, 2)
plt.title('Maximum Spanning Tree')
plot_communities(tree_max)
We chose to use Girven-Newman because by definition trees are constructed by bridges between otherwise disconnected sub-trees. Thus, by removing bridges, we naturally deconstruct the network into suitable communities. Other community detection algorithms covered in this course utilize the relatively larger number of internal links to define communities. This has a similar effect, as branches represent sub-graphs with the maximum density (for their size).
We can also see that in the minimum spanning treee, these communities are much more likely to cluster around high degree nodes, such as RKT, JET, RR, OCDO, and BME.
Find communities on the strongly connected component of the bitcoin transaction network. Is there any evidence of competing Cryptocurrency exchange platforms?
We load the bitcoin network and identify the strongly connected component as we did in Tutorial 3:
import zipfile as zp
with zp.ZipFile('2013-06-03_to_2013-06-09.zip') as arc:
with arc.open('2013-06-03_to_2013-06-09.graphml') as file:
G_bitcoin = nx.read_graphml(file)
components = nx.strongly_connected_components(G_bitcoin) # Identify all components.
G_strong = nx.subgraph(G_bitcoin, max(components, key = len)) # Subgraph the largest componenet.
print(f"Largest strongly connected component contains {len(G_strong.nodes)} of {len(G_bitcoin.nodes)} nodes")
Largest strongly connected component contains 41738 of 103076 nodes
To analyse exchange competition we aim to identify distinct communities for exchanges to serve and analyse the distribution of exchanges amoungst them. The primary function of exchanges is to indirectly connect users who want to buy bitcoin from those who want to sell. Unlike online stores, we expect exchanges to have a similar variety of incoming and outgoing transations. Thus we can identify them through their relatively high betweenness.
We see a sharp drop in betweenness at 0.01, which we assume is a sufficient value for a node to be considered to be an exchange.
score_betweenness = nx.betweenness_centrality(G_strong, k = 10000)
import numpy as np
import matplotlib.pyplot as plt
threshold = 1e-2
plt.figure(figsize = (8, 3))
plt.xlabel('Betweenness score')
plt.hist(score_betweenness.values(), bins = np.geomspace(1e-10, 1, 100))
plt.xscale('log')
plt.yscale('log')
plt.ylabel('Node frequency')
plt.axvline(x = threshold, linestyle = '--', color='r')
plt.show()
exchanges = {
n for n, d in score_betweenness.items()
if d > 2e-3
}
len(exchanges)
330
To identify distinct market we choose to use the Louvain method to partition the network into distinct communities of wallets. We assume these are suitable proxies for distinct markets for exchanges to serve. We avoid using Girvan-Newmann as the top-down method for identifying communities may interfere with identifying hub nodes.
import numpy as np
from collections import Counter
def plot_comm_sizes(resolution):
part = nx_comm.louvain_communities(
G = sum_directions(G_strong),
weight = 'no_transactions',
resolution = resolution
)
part_map = create_partition_map(part)
sizes = Counter(part_map.values())
plt.figure(figsize = (12, 4))
plt.title(f'Community Sizes: Resolution {resolution}')
xticks = np.arange(len(sizes.keys()))
plt.bar(xticks, [sizes[i] for i in xticks])
plt.xlabel('Community Index')
plt.xticks(xticks, list(sizes.keys()))
plt.ylabel('Size (log)')
plt.yscale('log')
plt.show()
return part
part_bitcoin_default = plot_comm_sizes(0.75)
part_bitcoin_large = plot_comm_sizes(0.25)
We see below that the 330 exchanges we identified earlier are poorly distributed amoung the communities. Even allowing for the varying sizes of the communities, the exchanges appear concentrated in the largest communities, with little accommodation for the smaller ones.
def exchange_competition(part, exchanges):
sizes = {comm: len(members) for comm, members in enumerate(part)}
n_exchanges = {
comm: len(members & exchanges)
for comm, members in enumerate(part)
}
plt.figure(figsize = (12, 4))
plt.title('Exchange Competition')
xticks = np.arange(len(part))
plt.bar(xticks, [n_exchanges[i] for i in xticks])
plt.xlabel('Community Index')
plt.xticks(xticks, list(sizes.keys()))
plt.ylabel('Number of Exchanges')
plt.show()
exchange_competition(part_bitcoin_default, exchanges)
exchange_competition(part_bitcoin_large, exchanges)