import networkx as nx
%matplotlib inline
In this tutorial we will put in practice what we learned about the different ways to characterise node importance and find important nodes.
nx.read_edgelist
assumes node names are strings¶Edge lists are a simple, plain text format for storing graphs. Since this simple file format doesn't contain information about data types, all node names are assumed to be strings by default. When the node names are given by integers, as they are in this example, we should specify the nodetype=int
keyword argument to avoid confusion with the node names.
Load one of the networks used in the previous tutorials
G = nx.Graph()
G.add_edges_from([
("Frankfurt","Stansted"),
("Stansted","Madrid"),
("Rome","Frankfurt"), ("Helsinki","Rome"), ("Stansted","Paris"),
("Rome","Paris"), ("Helsinki","Paris"), ("Paris","Frankfurt")
])
nx.draw(G, with_labels = True)
We can apply the max
function to get the maximum node according to some criterion. In our case, we want to compare the nodes by their degree:
highest_degree_node = max(G.nodes, key=G.degree)
highest_degree_node
'Paris'
G.degree(highest_degree_node)
4
NetworkX provides many functions regarding the computation of different types of centralities, each one with its own peculiarities and functionalities.
The eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. The eigenvector centrality for node $i$ is the $i$-th element of the vector $x$ defined by the equation
$$Ax=λx$$where $A$ is the adjacency matrix of the graph $G$ with eigenvalue $λ$.
nx.eigenvector_centrality(G)
{'Frankfurt': 0.46493880333267035, 'Stansted': 0.3834327635729094, 'Madrid': 0.12862901683739722, 'Rome': 0.45310115951559804, 'Helsinki': 0.3363097865225342, 'Paris': 0.5494186634404209}
The Katz centrality computes the centrality for a node based on the centrality of its neighbors. It is a generalization of the eigenvector centrality. The Katz centrality for node $i$ is
$$x_i=\alpha \sum_{j} A_{ij}x_j+\beta,$$where $A$ is the adjacency matrix of graph $G$ with eigenvalues $λ.$
The parameter $\beta$ controls the initial centrality and $$\alpha<\frac{1}{\lambda_{max}}.$$
The Katz centrality computes the relative influence of a node within a network by measuring the number of the immediate neighbors (first degree nodes) and also all other nodes in the network that connect to the node under consideration through these immediate neighbors. Extra weight can be provided to immediate neighbors through the parameter $\beta$. Connections made with distant neighbors are, however, penalized by an attenuation factor $\alpha$ which should be strictly less than the inverse largest eigenvalue of the adjacency matrix in order for the Katz centrality to be computed correctly.
The Katz centrality is a computationally-intense function and that is why it can fail when dealing with very large network data. If this is the case, the Numpy implementation of the original function can be used.
nx.katz_centrality(G)
{'Frankfurt': 0.42403680743356886, 'Stansted': 0.4163193337863001, 'Madrid': 0.336061385349643, 'Rome': 0.420953676972386, 'Helsinki': 0.38240488692433294, 'Paris': 0.45880088037619626}
nx.katz_centrality_numpy(G)
/afs/inf.ed.ac.uk/user/s09/s0952880/.conda/envs/dbba_env/lib/python3.8/site-packages/networkx/algorithms/centrality/katz.py:325: FutureWarning: adjacency_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0. A = nx.adjacency_matrix(G, nodelist=nodelist, weight=weight).todense().T
{'Frankfurt': 0.42403681498960366, 'Stansted': 0.41631933079627154, 'Madrid': 0.3360613571460131, 'Rome': 0.42095368327266774, 'Helsinki': 0.38240488190997657, 'Paris': 0.458800895163238}
The betweenness centrality of a node $v$ is the sum of the fraction of all-pairs shortest paths that pass through $v$
$$c_B(v)=\sum_{s,t \in V} \frac{\sigma(s,t|v)}{\sigma(s,t)}$$where $V$ is the set of nodes, $\sigma(s,t)$ is the number of shortest $(s,t)$-paths, and $\sigma(s,t|v)$ is the number of those paths passing through some node $v$ other than $s,t$. If $s=t$, $\sigma(s,t)=1$, and if $v \in s,t$ $\sigma(s,t|v)=0$.
nx.betweenness_centrality(G)
{'Frankfurt': 0.1, 'Stansted': 0.4, 'Madrid': 0.0, 'Rome': 0.05, 'Helsinki': 0.0, 'Paris': 0.35000000000000003}
An efficient way of illustrating the centrality information regarding all the nodes in a network is a centrality distribution, it summarizes the most important properties across the graph. First, we have to build a sequence with all the centralities. As the function returns a dictionary, we use a list comprehension with the dictionary values i.e., .values()
. Let's start from the eigenvector centrality!
eigenvec_centr = list(nx.eigenvector_centrality(G).values())
import numpy as np
print('Mean centrality:', np.mean(eigenvec_centr))
print('Median centrality:', np.median(eigenvec_centr))
Mean centrality: 0.38597169887025506 Median centrality: 0.4182669615442537
An useful way of graphically representing the sequences of centrality values is by a histogram, plotting on the $x$-axis the centrality values and on the $y$-axis the number of nodes having that centrality.
import matplotlib.pyplot as plt
counts, bins, patches = plt.hist(eigenvec_centr, bins=20)
plt.xlabel("Centrality")
plt.ylabel("Absolute frequency")
plt.show()
Returning counts
and bins
, we can investigate further them or analyse a given bin and how many values fall into.
counts
array([1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 2., 0., 0., 0., 1.])
bins
array([0.12862902, 0.1496685 , 0.17070798, 0.19174746, 0.21278695, 0.23382643, 0.25486591, 0.27590539, 0.29694488, 0.31798436, 0.33902384, 0.36006332, 0.3811028 , 0.40214229, 0.42318177, 0.44422125, 0.46526073, 0.48630022, 0.5073397 , 0.52837918, 0.54941866])
Contents:
For each of the models presented in this tutorial, we present the algorithm, an example implementation, and the corresponding NetworkX code function to generate the model. The implementations provided in this tutorial are written for clarity, not for efficiency; it's best to use the NetworkX functions when doing real analysis work.
random
module¶Many network models rely on randomness in their generative algorithms. Python's random module provides four key functions of use when coding network models.
random.random
¶Often in an algorithm, we need something to happen with some probability $p$. The canonical way to decide whether or not such an event happens is to generate a random number $r$ between 0 and 1, and if $r < p$, then the event occurs. The random.random() function returns just such a random number in the interval [0, 1).
For a simple example, consider an unfair coin that comes up heads 75% of the time. We can write the following code to flip such a coin 10 times, reporting the outcome each time:
import random
p = 0.75
# Do this 10 times
for _ in range(10):
r = random.random()
if r < p:
print('Heads')
else:
print('Tails')
Heads Heads Heads Heads Tails Heads Heads Heads Heads Tails
As we would expect from a random process, executing the previous cell again will generate a different sequence of flips -- each one is independently generated.
random.choice
¶When we have a population of discrete choices and we need to select one at random, we use random.choice(). For example, instead of "eeny, meeny, miny, moe," we can use random.choice to choose a random name:
names = ['Alice', 'Bob', 'Cathy', 'Dan']
random.choice(names)
'Cathy'
random.sample
¶If we have a collection and we need to select more than one element without replacement, we use random.sample(). For example, to choose two nodes at random from the nodes in a graph, we can use the following:
G = nx.cycle_graph(5)
random.sample(G.nodes, 2)
[1, 4]
random.choices
¶We use random.choices() when we need to choose an element from a collection when the chances of selecting each element are not identical.
For an example, let's assume Alice, Bob, and Carol are in a raffle drawing. Alice bought one ticket, Bob bought three tickets, and Carol bought four tickets. We can simulate ten different draws of this raffle, replacing the drawn ticket each time, with the following code:
names = ['Alice', 'Bob', 'Carol']
tickets = [1, 3, 4]
for _ in range(10):
print(random.choices(names, tickets))
['Carol'] ['Carol'] ['Carol'] ['Carol'] ['Alice'] ['Bob'] ['Bob'] ['Carol'] ['Bob'] ['Bob']
Running the above cell should give what we expect: Carol wins the drawing most often, with Bob winning some times, and Alice winning occasionally. Of course this outcome depends on the luck of the draw!
By specifying the keyword argument k=
, we can choose k items from the collection with replacement:
random.choices(names, tickets, k=10)
['Carol', 'Carol', 'Carol', 'Alice', 'Bob', 'Alice', 'Bob', 'Carol', 'Bob', 'Carol']
The weights provided to random.choices
do not have to be integers -- any numeric weights are fine.
The random network model, as formulated by Gilbert, has two parameters: the number of nodes $N$, and the link probability $p$. As in the book text, the algorithm for creating this network is as follows:
We'll need a couple of tools from Python for this task:
We've previously looped over all nodes in a graph, as well as all graph edges, but this algorithm requires us to loop over all pairs of nodes, i.e. all possible edges. The itertools
module in Python's standard library gives us the combinations() function, an elegant way to loop over pairs of elements in a sequence:
import itertools
elements = [0, 1, 2, 3, 4]
list(itertools.combinations(elements, 2))
[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
The second argument to itertools.combinations()
is the length of the sequences we want in the output. Since we want pairs, we'll specify 2. Note the nice properties of the output:
('a', 'a')
.('a', 'b')
and ('b', 'a')
are the same edge in an undirected graph.We can thus use this to loop over all pairs of nodes in a graph:
G = nx.Graph()
G.add_nodes_from(elements)
list(itertools.combinations(G.nodes, 2))
[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
With these tools in our toolbelt, we can code the algorithm for the Gilbert random graph model.
def gnp_random_graph(N, p):
G = nx.Graph()
G.add_nodes_from(range(N))
for i, j in itertools.combinations(G.nodes, 2):
r = random.random()
if r < p:
G.add_edge(i, j)
# Do nothing if r >= p
return G
We can use this function to generate a graph. Since this is a random graph, each execution of the following code will generate a different graph.
G = gnp_random_graph(16, 0.15)
nx.draw(G)
print('Graph has', G.number_of_edges(), 'edges.')
Graph has 18 edges.
Run the above cell a few times and note that the number of edges varies slightly among random graphs generated with the same parameters. This is because each pair of nodes has an independent chance of being an edge.
Instead of specifying the link probability $p$, we can also generate a graph of $N$ nodes with exactly $M$ edges by using random.sample()
to choose M of the possible edges generated with itertools.combinations()
:
def gnm_random_graph(N, M):
G = nx.Graph()
G.add_nodes_from(range(N))
possible_edges = itertools.combinations(G.nodes, 2)
edges_to_add = random.sample(list(possible_edges), M)
G.add_edges_from(edges_to_add)
return G
G = gnm_random_graph(16, 18)
nx.draw(G)
NetworkX has a function for the $G_{n,p}$ random graph specifying number of nodes $N$ and link probability $p$: gnp_random_graph().
In addition, NetworkX provides gnm_random_graph(), which generates a $G_{n,m}$ graph, where we specify the number of nodes $N$ and the desired number of edges $M$.
The algorithm for generating a small-world network is as such:
We'll do these step-by-step first, and combine them into a function last.
N = 10
G = nx.cycle_graph(N)
nx.draw_circular(G, with_labels=True)
We'll use modular arithmetic in order to do this. As an example of why, let's say $k$ is 4. So for node $n$, we want to add edges to $n$'s 4 nearest neighbors: $n-1, n+1, n-2,$ and $n+2.$ Since our nodes are in a circle, these can "wrap around", e.g. the 4 nearest neighbors of node 0 are N-1, 1, N-2, and 2.
Note the use of integer division (//) below. Integer division throws away the fractional part of division, e.g.
5 // 2 = 2
k = 4
for n in G.nodes:
for i in range(1, k // 2 + 1):
left = (n-i) % N
right = (n+i) % N
G.add_edge(n, left)
G.add_edge(n, right)
nx.draw_circular(G, with_labels=True)
For each edge $(u, v)$, with probability $p$, replace edge $(u, v)$ with (u, w) where $w$ is not a neighbor of $u$.
For this step, we make use of set arithmetic in order to generate a list of nodes that are not neighbors of $u$, and random.choice
to select $w$ at random from that set of "not neighbors."
p = 0.1
for u, v in list(G.edges):
if random.random() < p:
not_neighbors = set(G.nodes) - set(G.neighbors(u))
w = random.choice(list(not_neighbors))
G.remove_edge(u, v)
G.add_edge(u, w)
nx.draw_circular(G, with_labels=True)
We can put this together to write a basic function for the small-world model:
def watts_strogatz_graph(N, k, p):
# 1. Create a ring of N nodes
G = nx.cycle_graph(N)
# 2. Connect each node n to k nearest neighbors
# [n-(k//2), ... , n-1, n+1, ... , n+(k//2)]
for n in G.nodes:
for i in range(1, k // 2 + 1):
left = (n-i) % N
right = (n+i) % N
G.add_edge(n, left)
G.add_edge(n, right)
# 3. Rewire edges with probability p
for u, v in list(G.edges):
if random.random() < p:
not_neighbors = set(G.nodes) - set(G.neighbors(u)) - {u}
w = random.choice(list(not_neighbors))
G.remove_edge(u, v)
G.add_edge(u, w)
return G
G = watts_strogatz_graph(16, 4, 0.2)
nx.draw_circular(G, with_labels=True)
NetworkX has a function for this model: watts_strogatz_graph().
The Barabási-Albert preferential attachment model has two parameters: the number of nodes $N$, and the number of links added at each step $m$. Given these parameters, the algorithm is as follows:
The code for this is thus straightforward, with one possible exception: for step 3, we need to generate a degree sequence to weight the random selection. If we have a graph G
, we can generate such a degree sequence with a list comprehension as follows:
G = nx.star_graph(4)
degrees = [G.degree(n) for n in G.nodes]
print(degrees)
nx.draw(G, with_labels=True)
[4, 1, 1, 1, 1]
Note that the degree sequence is output in the same order as the nodes, such that the node at index $i$ in that list has the degree at index $i$ of the corresponding degree sequence. With this, we can write a function for the BA preferential attachment model:
def barabasi_albert_graph(N, m):
# 1. Start with a clique of m+1 nodes
G = nx.complete_graph(m + 1)
for i in range(G.number_of_nodes(), N):
# 2. Select m different nodes at random, weighted by their degree.
new_neighbors = []
possible_neighbors = list(G.nodes)
for _ in range(m):
degrees = [G.degree(n) for n in possible_neighbors]
j = random.choices(possible_neighbors, degrees)[0]
new_neighbors.append(j)
possible_neighbors.remove(j)
# 3. Add a new node i and link it with the m nodes from the previous step.
for j in new_neighbors:
G.add_edge(i, j)
return G
Load the weekly Bitcoin transaction dataset, and plot the degree distributions (i.e., in-degrees, out-degrees, total degrees).
HINT: use the function nx.read_graphml() to open the network.
Use the above function to read in the network before using plt.hist
to plot the degree distribution as before.
However, this time it is helpful to use logarithmically spaced bins for log x-scale as the degree distribution is scale free.
For this we can use a numpy function np.logspace
to create an array bin thresholds.
We have chosen to ignore the quantity of transactions made between two nodes (the qty
attribute) when calculating a nodes degree.
However, it is also reasonable to treat the network as a multi-graph and count each transaction serparately.
import numpy as np
import zipfile as zp
import matplotlib.pyplot as plt
with zp.ZipFile('2013-06-03_to_2013-06-09.zip') as arc:
with arc.open('2013-06-03_to_2013-06-09.graphml') as file:
G = nx.read_graphml(file)
bins = np.logspace(np.log10(1), np.log10(1e5), 25)
def plot_hist(degrees):
plt.hist([deg for _, deg in degrees], bins)
plt.xscale('log')
plt.yscale('log')
plt.figure(figsize = (15, 3))
plt.subplot(1, 3, 1)
plt.title('In-degree distribution')
plot_hist(G.in_degree())
plt.subplot(1, 3, 2)
plt.title('Out-degree distribution')
plot_hist(G.out_degree())
plt.subplot(1, 3, 3)
plt.title('Total-degree distribution')
plot_hist(G.degree())
plt.show()
Find the 5 largest hubs based on their in-degree, out-degree, and total degree. Do these three lists overlap? Discuss the results.
As in previous exercises, we can use pythons built-in function sorted
to identify the nodes with the largest in, out, and total degree.
For convenience, we use a DataFrame to compare the contents of each list:
import pandas as pd
def top_five(deg_dist):
return [id for id, _ in sorted(deg_dist, key = lambda p: -p[1])][:5]
pd.DataFrame({
'K-in': top_five(G.in_degree()),
'K-out': top_five(G.out_degree()),
'K': top_five(G.degree())
}).T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
K-in | 24778 | 1056959 | 3304013 | 1623263 | 46606 |
K-out | 24778 | 1676323 | 3183164 | 1304872 | 857809 |
K | 24778 | 1676323 | 1056959 | 3183164 | 1304872 |
The first node in each list is 24778
, having the largest in and out degree, meaning by definition it must also have the largest total degree.
The four other nodes in the highest total degree list also appear in one of the others.
In all but one case this is the out-degree list.
This should be expected, as in our degree distributions in exercise 1, we can see the out-degree distribution has a longer tail.
Compute the centrality values of the nodes in the bitcoin network, using betweenness and closeness. Do you see any differences between the two distributions?
As both closeness and betweenness are path-based measures, there is some preprocessing required before we compute centralities. We must eliminate disconnected components, filtering the network to the giant strongly connected component (GSCC) ensures there is route to and from each pair of nodes.
Networkx has functions for calculating both closeness and betweenness centrality. We ignore edge weights entirely here, but we could use both number and size of transactions as a measure of edge importance. To use these for closeness or betweenness, we would need to convert these values into a distance measure.
components = nx.strongly_connected_components(G) # Identify all components.
G_strong = nx.subgraph(G, max(components, key = len)) # Subgraph the largest componenet.
score_closeness = nx.closeness_centrality(G_strong)
score_betweenness = nx.betweenness_centrality(G_strong, k = 10000)
With the closeness and betweenness calculate for each node, we can extract the values and compare the distributions. Summarising the major statistics in a pandas dataframe and plotting a histogram on a loglog scale.
import pandas as pd
pd.DataFrame({
'Closeness': list(score_closeness.values()),
'Betweenness': list(score_betweenness.values())
}).describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Closeness | 41738.0 | 0.229600 | 0.075960 | 0.003701 | 0.165312 | 0.245558 | 0.325475 | 0.482520 |
Betweenness | 41738.0 | 0.000124 | 0.004867 | 0.000000 | 0.000000 | 0.000024 | 0.000083 | 0.988976 |
import numpy as np
import matplotlib.pyplot as plt
def logloghist(dist, xmin, xmax, bins = 25):
plt.hist(list(dist), bins = np.geomspace(xmin, xmax, bins))
plt.xscale('log')
plt.yscale('log')
plt.figure(figsize = (10, 3))
plt.suptitle('Centrality Distributions')
plt.subplot(1, 2, 1)
logloghist(score_closeness.values(), 1e-3, 1)
plt.xlabel('Closeness score')
plt.ylabel('Node frequency')
plt.subplot(1, 2, 2)
logloghist(score_betweenness.values(), 1e-10, 1)
plt.xlabel('Betweenness score')
plt.ylabel('Node frequency')
plt.show()
def count_zeros(name, dist):
n_zero = sum(1 for i in dist if i <= 0)
print(f"# nodes with zero {name}: {n_zero} ({n_zero/len(dist):.1%})")
count_zeros('closeness', score_closeness.values())
count_zeros('betweenness', score_betweenness.values())
# nodes with zero closeness: 0 (0.0%) # nodes with zero betweenness: 12000 (28.8%)
We can see that both distributions vary wildly in scale, necessitating the use of a log log plot. Whilst the standard deviation of betweenness is less, this is only because the majority of values are smaller, and the distribution spans more orders of magnitude.
Another difference is peak closeness is relatively larger and higher than the peak betweenness. Betweenness is more sensitive to hubs than closeness. To accrue betweenness, a node must lie on the shortest path between two nodes, which is never the case for 12,000 nodes in the bitcoin network, and frequently exclude many other non-hubs. By contrast, non-hubs may accrue a modest closeness merely through proximity to a good hub.
Even in these plots we can pick out the hub 24778, which forms its own bin in both graphs. Its high betweenness score indicates it is frequently along the shortest path between two nodes, explaining its singularly high closeness, as it is the only node that need not include it in its shortest paths.
pd.DataFrame({
'K': top_five(G.degree()),
'Closeness': top_five(score_closeness.items()),
'Betweenness': top_five(score_betweenness.items())
}).T
49.293% are immediate neighbours of node 24778.
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
K | 24778 | 1676323 | 1056959 | 3183164 | 1304872 |
Closeness | 24778 | 1056959 | 1885981 | 3327900 | 3276457 |
Betweenness | 24778 | 1056959 | 3024702 | 3315476 | 3338162 |
print(f"{G.degree('24778') / len(G.nodes):.3%} are immediate neighbours of node 24778.")
49.293% are immediate neighbours of node 24778.
Looking at the top 5 nodes for each centrality, we see that all include both 24778 and 1056959, but otherwise there is no similarity between the three lists. 24778s high centrality may be explainable partly by its neighbourhood, which on its own accounts for almost half of all nodes in the GSCC.