{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data-driven Business and Behaviour Analytics (DBBA)\n",
    "# Tutorial: 3 - Hubs, centrality, network models + bipartite projections, and backboning\n",
    "## PART I\n",
    "\n",
    "In this tutorial we will put in practice what we learned about the different ways to characterise node importance and find important nodes.\n",
    "\n",
    "### pro tip: by default, `nx.read_edgelist` assumes node names are strings\n",
    "\n",
    "Edge lists are a simple, plain text format for storing graphs. Since this simple file format doesn't contain information about data types, all node names are assumed to be strings by default. When the node names are given by integers, as they are in this example, we should specify the `nodetype=int` keyword argument to avoid confusion with the node names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load one of the networks used in the previous tutorials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = nx.read_edgelist('PATHTONETWORK/network', nodetype=int)\n",
    "print(nx.info(G))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx.draw(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Finding the node with max degree\n",
    "\n",
    "We can apply the `max` function to get the maximum node according to some criterion. In our case, we want to compare the nodes by their degree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "highest_degree_node = max(G.nodes, key=G.degree)\n",
    "highest_degree_node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G.degree(highest_degree_node)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding hubs\n",
    "\n",
    "NetworkX provides many functions regarding the computation of different types of centralities, each one with its own peculiarities and functionalities. \n",
    "\n",
    "The __eigenvector centrality__ computes the centrality for a node based on the centrality of its neighbors. The eigenvector centrality for node $i$ is the $i$-th element of the vector $x$ defined by the equation\n",
    "\n",
    "$$Ax=λx$$\n",
    "\n",
    "where $A$ is the adjacency matrix of the graph $G$ with eigenvalue $λ$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx.eigenvector_centrality(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The __Katz centrality__ computes the centrality for a node based on the centrality of its neighbors. It is a generalization of the eigenvector centrality. The Katz centrality for node $i$ is\n",
    "\n",
    "$$x_i=\\alpha \\sum_{j} A_{ij}x_j+\\beta,$$\n",
    "\n",
    "where $A$ is the adjacency matrix of graph $G$ with eigenvalues $λ.$\n",
    "\n",
    "The parameter $\\beta$ controls the initial centrality and $$\\alpha<\\frac{1}{\\lambda_{max}}.$$\n",
    "\n",
    "The __Katz centrality__ computes the relative influence of a node within a network by measuring the number of the immediate neighbors (first degree nodes) and also all other nodes in the network that connect to the node under consideration through these immediate neighbors. Extra weight can be provided to immediate neighbors through the parameter $\\beta$. Connections made with distant neighbors are, however, penalized by an attenuation factor $\\alpha$ which should be strictly less than the inverse largest eigenvalue of the adjacency matrix in order for the Katz centrality to be computed correctly.\n",
    "\n",
    "The __Katz centrality__ is a computationally-intense function and that is why it can fail when dealing with very large network data. If this is the case, the [Numpy](https://numpy.org/doc/stable/) implementation of the original function can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx.katz_centrality(G)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx.katz_centrality_numpy(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The __betweenness centrality__ of a node $v$ is the sum of the fraction of all-pairs shortest paths that pass through $v$\n",
    "\n",
    "$$c_B(v)=\\sum_{s,t \\in V} \\frac{\\sigma(s,t|v)}{\\sigma(s,t)}$$\n",
    "\n",
    "where $V$ is the set of nodes, $\\sigma(s,t)$ is the number of shortest $(s,t)$-paths, and $\\sigma(s,t|v)$ is the number of those paths passing through some node $v$ other than $s,t$. If $s=t$, $\\sigma(s,t)=1$, and if $v \\in s,t$ $\\sigma(s,t|v)=0$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx.betweenness_centrality(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Centrality distributions\n",
    "\n",
    "An efficient way of illustrating the centrality information regarding all the nodes in a network is a centrality distribution, it summarizes the most important properties across the graph. First, we have to build a sequence with all the centralities. As the function returns a dictionary, we use a list comprehension with the dictionary values i.e., `.values()`. Let's start from the __eigenvector centrality__!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eigenvec_centr = list(nx.eigenvector_centrality(G).values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "print('Mean centrality:', np.mean(eigenvec_centr))\n",
    "print('Median centrality:', np.median(eigenvec_centr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An useful way of graphically representing the sequences of centrality values is by a _histogram_, plotting on the $x$-axis the centrality values and on the $y$-axis the number of nodes having that centrality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "counts, bins, patches = plt.hist(eigenvec_centr, bins=20)\n",
    "plt.xlabel(\"Centrality\")\n",
    "plt.ylabel(\"Absolute frequency\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Returning `counts` and `bins`, we can investigate further them or analyse a given bin and how many values fall into."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "bins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## PART II"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contents:\n",
    "1. [Python's random module](#1.-Python's-random-module)\n",
    "2. [Random network model](#2.-Random-Network-Model)\n",
    "3. [Small-World model](#3.-Small-World-Model)\n",
    "4. [Preferential attachment model](#4.-Preferential-Attachment-Model)\n",
    "\n",
    "For each of the models presented in this tutorial, we present the algorithm, an example implementation, and the corresponding NetworkX code function to generate the model. The implementations provided in this tutorial are written for clarity, not for efficiency; it's best to use the NetworkX functions when doing real analysis work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Python's `random` module\n",
    "\n",
    "Many network models rely on randomness in their generative algorithms. Python's [random module](https://docs.python.org/3.7/library/random.html) provides four key functions of use when coding network models.\n",
    "\n",
    "### `random.random`\n",
    "\n",
    "Often in an algorithm, we need something to happen with some probability $p$. The canonical way to decide whether or not such an event happens is to generate a random number $r$ between 0 and 1, and if $r < p$, then the event occurs. The [random.random()](https://docs.python.org/3.7/library/random.html#random.random) function returns just such a random number in the interval [0, 1).\n",
    "\n",
    "For a simple example, consider an unfair coin that comes up heads 75% of the time. We can write the following code to flip such a coin 10 times, reporting the outcome each time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "p = 0.75\n",
    "\n",
    "# Do this 10 times\n",
    "for _ in range(10):\n",
    "    r = random.random()\n",
    "    if r < p:\n",
    "        print('Heads')\n",
    "    else:\n",
    "        print('Tails')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we would expect from a random process, executing the previous cell again will generate a different sequence of flips -- each one is independently generated.\n",
    "\n",
    "###  `random.choice`\n",
    "\n",
    "When we have a population of discrete choices and we need to select one at random, we use [random.choice()](https://docs.python.org/3.7/library/random.html#random.choice). For example, instead of \"[eeny, meeny, miny, moe](https://en.wikipedia.org/wiki/Eeny,_meeny,_miny,_moe),\" we can use random.choice to choose a random name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "names = ['Alice', 'Bob', 'Cathy', 'Dan']\n",
    "random.choice(names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `random.sample`\n",
    "\n",
    "If we have a collection and we need to select more than one element without replacement, we use [random.sample()](https://docs.python.org/3.7/library/random.html#random.sample). For example, to choose two nodes at random from the nodes in a graph, we can use the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = nx.cycle_graph(5)\n",
    "random.sample(G.nodes, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `random.choices`\n",
    "\n",
    "We use [random.choices()](https://docs.python.org/3.7/library/random.html#random.choices) when we need to choose an element from a collection when the chances of selecting each element are not identical.\n",
    "\n",
    "For an example, let's assume Alice, Bob, and Carol are in a raffle drawing. Alice bought one ticket, Bob bought three tickets, and Carol bought four tickets. We can simulate ten different draws of this raffle, replacing the drawn ticket each time, with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names = ['Alice', 'Bob', 'Carol']\n",
    "tickets = [1, 3, 4]\n",
    "\n",
    "for _ in range(10):\n",
    "    print(random.choices(names, tickets))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running the above cell should give what we expect: Carol wins the drawing most often, with Bob winning some times, and Alice winning occasionally. Of course this outcome depends on the luck of the draw!\n",
    "\n",
    "By specifying the keyword argument `k=`, we can choose *k* items from the collection *with replacement*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "random.choices(names, tickets, k=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weights provided to `random.choices` do not have to be integers -- any numeric weights are fine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Random Network Model\n",
    "\n",
    "The random network model, as formulated by Gilbert, has two parameters: the number of nodes $N$, and the link probability $p$. As in the book text, the algorithm for creating this network is as follows:\n",
    "\n",
    "1. Select a pair of nodes, say i and j.\n",
    "2. Generate a random number r between 0 and 1. If r < p, then add a link between i and j.\n",
    "3. Repeat (1) and (2) for all pairs of nodes.\n",
    "\n",
    "We'll need a couple of tools from Python for this task:\n",
    "\n",
    "### Generating combinations\n",
    "\n",
    "We've previously looped over all nodes in a graph, as well as all graph edges, but this algorithm requires us to loop over all *pairs* of nodes, *i.e.* all possible edges. The `itertools` module in Python's standard library gives us the [combinations()](https://docs.python.org/3.7/library/itertools.html#itertools.combinations) function, an elegant way to loop over pairs of elements in a sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "elements = [0, 1, 2, 3, 4]\n",
    "list(itertools.combinations(elements, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second argument to `itertools.combinations()` is the length of the sequences we want in the output. Since we want pairs, we'll specify 2. Note the nice properties of the output:\n",
    "\n",
    "1. No repeat elements -- we don't want to consider self-loops like `('a', 'a')`.\n",
    "2. Pairs are in sorted order -- `('a', 'b')` and `('b', 'a')` are the same edge in an undirected graph.\n",
    "\n",
    "We can thus use this to loop over all pairs of nodes in a graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "G = nx.Graph()\n",
    "G.add_nodes_from(elements)\n",
    "\n",
    "list(itertools.combinations(G.nodes, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gilbert random graph model\n",
    "\n",
    "With these tools in our toolbelt, we can code the algorithm for the Gilbert random graph model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def gnp_random_graph(N, p):\n",
    "    G = nx.Graph()\n",
    "    G.add_nodes_from(range(N))\n",
    "    \n",
    "    for i, j in itertools.combinations(G.nodes, 2):\n",
    "        r = random.random()\n",
    "        if r < p:\n",
    "            G.add_edge(i, j)\n",
    "        # Do nothing if r >= p\n",
    "        \n",
    "    return G"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this function to generate a graph. Since this is a random graph, each execution of the following code will generate a different graph. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = gnp_random_graph(16, 0.15)\n",
    "nx.draw(G)\n",
    "print('Graph has', G.number_of_edges(), 'edges.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the above cell a few times and note that the number of edges varies slightly among random graphs generated with the same parameters. This is because each pair of nodes has an independent chance of being an edge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Erdös-Rényi random graph model\n",
    "\n",
    "Instead of specifying the link probability $p$, we can also generate a graph of $N$ nodes with exactly $M$ edges by using `random.sample()` to choose M of the possible edges generated with `itertools.combinations()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "def gnm_random_graph(N, M):\n",
    "    G = nx.Graph()\n",
    "    G.add_nodes_from(range(N))\n",
    "    \n",
    "    possible_edges = itertools.combinations(G.nodes, 2)\n",
    "    edges_to_add = random.sample(list(possible_edges), M)\n",
    "    G.add_edges_from(edges_to_add)\n",
    "    \n",
    "    return G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = gnm_random_graph(16, 18)\n",
    "nx.draw(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NetworkX functions\n",
    "\n",
    "NetworkX has a function for the $G_{n,p}$ random graph specifying number of nodes $N$ and link probability $p$:  [gnp_random_graph()](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.random_graphs.gnp_random_graph.html).\n",
    "\n",
    "In addition, NetworkX provides [gnm_random_graph()](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.random_graphs.gnm_random_graph.html), which generates a $G_{n,m}$ graph, where we specify the number of nodes $N$ and the desired number of edges $M$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Small-World Model\n",
    "\n",
    "The algorithm for generating a small-world network is as such:\n",
    "\n",
    "1. Begin with a ring of $N$ nodes\n",
    "2. Connect each node to its $k$ nearest neighbors (or $k-1$ if k is odd).\n",
    "3. For each edge $(u, v)$, with probability $p$, replace edge $(u, v)$ with $(u, w)$ where $w$ is not a neighbor of $u$.\n",
    "\n",
    "We'll do these step-by-step first, and combine them into a function last.\n",
    "\n",
    "### Create a ring of N nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 10\n",
    "G = nx.cycle_graph(N)\n",
    "nx.draw_circular(G, with_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect each node to its $k$ nearest neighbors\n",
    "\n",
    "We'll use modular arithmetic in order to do this. As an example of why, let's say $k$ is 4. So for node $n$, we want to add edges to $n$'s 4 nearest neighbors: $n-1, n+1, n-2,$ and $n+2.$ Since our nodes are in a circle, these can \"wrap around\", *e.g.* the 4 nearest neighbors of node 0 are N-1, 1, N-2, and 2.\n",
    "\n",
    "Note the use of integer division (//) below. Integer division throws away the fractional part of division, e.g.\n",
    "\n",
    "    5 // 2 = 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "k = 4\n",
    "\n",
    "for n in G.nodes:\n",
    "    for i in range(1, k // 2 + 1):\n",
    "        left  = (n-i) % N\n",
    "        right = (n+i) % N \n",
    "        G.add_edge(n, left)\n",
    "        G.add_edge(n, right)\n",
    "\n",
    "nx.draw_circular(G, with_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rewire some edges\n",
    "\n",
    "> For each edge $(u, v)$, with probability $p$, replace edge $(u, v)$ with (u, w) where $w$ is not a neighbor of $u$.\n",
    "\n",
    "For this step, we make use of set arithmetic in order to generate a list of nodes that are not neighbors of $u$, and `random.choice` to select $w$ at random from that set of \"not neighbors.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "p = 0.1\n",
    "\n",
    "for u, v in list(G.edges):\n",
    "    if random.random() < p:\n",
    "        not_neighbors = set(G.nodes) - set(G.neighbors(u))\n",
    "        w = random.choice(list(not_neighbors))\n",
    "        G.remove_edge(u, v)\n",
    "        G.add_edge(u, w)\n",
    "\n",
    "nx.draw_circular(G, with_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Watts-Strogatz Small-World Model\n",
    "\n",
    "We can put this together to write a basic function for the small-world model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def watts_strogatz_graph(N, k, p):\n",
    "    # 1. Create a ring of N nodes\n",
    "    G = nx.cycle_graph(N)\n",
    "\n",
    "    # 2. Connect each node n to k nearest neighbors\n",
    "    #    [n-(k//2), ... , n-1, n+1, ... , n+(k//2)]\n",
    "    for n in G.nodes:\n",
    "        for i in range(1, k // 2 + 1):\n",
    "            left  = (n-i) % N\n",
    "            right = (n+i) % N \n",
    "            G.add_edge(n, left)\n",
    "            G.add_edge(n, right)\n",
    "    \n",
    "    # 3. Rewire edges with probability p\n",
    "    for u, v in list(G.edges):\n",
    "        if random.random() < p:\n",
    "            not_neighbors = set(G.nodes) - set(G.neighbors(u)) - {u}\n",
    "            w = random.choice(list(not_neighbors))\n",
    "            G.remove_edge(u, v)\n",
    "            G.add_edge(u, w)\n",
    "\n",
    "    return G"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "G = watts_strogatz_graph(16, 4, 0.2)\n",
    "nx.draw_circular(G, with_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NetworkX function\n",
    "\n",
    "NetworkX has a function for this model: [watts_strogatz_graph()](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.random_graphs.watts_strogatz_graph.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Preferential Attachment Model\n",
    "\n",
    "The Barabási-Albert preferential attachment model has two parameters: the number of nodes $N$, and the number of links added at each step $m$. Given these parameters, the algorithm is as follows:\n",
    "\n",
    "1. Start with a clique of $m + 1$ nodes.\n",
    "2. Select $m$ different nodes at random, weighted by their degree.\n",
    "3. Add a new node $i$ and link it with the $m$ nodes from the previous step.\n",
    "4. Repeat 2-3 until there are N nodes in the graph.\n",
    "\n",
    "The code for this is thus straightforward, with one possible exception: for step 3, we need to generate a *degree sequence* to weight the random selection. If we have a graph `G`, we can generate such a degree sequence with a list comprehension as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "G = nx.star_graph(4)\n",
    "degrees = [G.degree(n) for n in G.nodes]\n",
    "\n",
    "print(degrees)\n",
    "nx.draw(G, with_labels=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the degree sequence is output in the same order as the nodes, such that the node at index $i$ in that list has the degree at index $i$ of the corresponding degree sequence. With this, we can write a function for the BA preferential attachment model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def barabasi_albert_graph(N, m):\n",
    "    # 1. Start with a clique of m+1 nodes\n",
    "    G = nx.complete_graph(m + 1)\n",
    "    for i in range(G.number_of_nodes(), N):\n",
    "        # 2. Select m different nodes at random, weighted by their degree.\n",
    "        new_neighbors = []\n",
    "        possible_neighbors = list(G.nodes)\n",
    "        for _ in range(m):\n",
    "            degrees = [G.degree(n) for n in possible_neighbors]\n",
    "            j = random.choices(possible_neighbors, degrees)[0]\n",
    "            new_neighbors.append(j)\n",
    "            possible_neighbors.remove(j)\n",
    "        \n",
    "        # 3. Add a new node i and link it with the m nodes from the previous step.\n",
    "        for j in new_neighbors:\n",
    "            G.add_edge(i, j)\n",
    "\n",
    "    return G"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PART III"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This part of the tutorial demonstrates how to create a unipartite projection from a bipartite graph and apply backboning techniques using Python libraries. We will use the network_map2 library for projections and the backboning library for edge filtering."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will need to download and install the required libraries manually:\n",
    "\n",
    "Download network_map2 from https://www.michelecoscia.com/?page_id=734.\n",
    "\n",
    "Download backboning from https://www.michelecoscia.com/?page_id=287.\n",
    "\n",
    "After downloading, make sure the libraries are accessible in your Python environment.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imoport Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import networkx as nx\n",
    "import network_map2 as nm2\n",
    "import backboning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imporrt Bipartite Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the edge list and create a bipartite graph\n",
    "B = nx.from_pandas_edgelist(pd.read_csv(\"company_investment_edges.csv\"), 'Company', 'Investment', ['Weight'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a Unipartite Projection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a unipartite projection using cosine similarity\n",
    "company_projection = nm2.cosine(B, list(nx.algorithms.bipartite.sets(B)[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\textbf{Alternative Projection Methods}$\n",
    "\n",
    "In addition to cosine similarity, the network_map2 library provides several other projection methods you can explore, such as Jaccard similarity, overlap, and more. These methods may be better suited for specific datasets or analysis goals.\n",
    "\n",
    "For a comprehensive overview of the available projection methods and their applications, please refer to Michele Coscia's Projection Website."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert Projection to DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the projection to a DataFrame and save as a text file\n",
    "projection_df = pd.DataFrame(company_projection.edges(data=True), columns=['src', 'trg', 'Weight'])\n",
    "projection_df['Weight'] = projection_df['Weight'].apply(lambda x: x['weight'])\n",
    "projection_df.to_csv('company_projection.txt', sep='\\t', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apply Backboning -- Reduce the Number of Edges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Backboning is a technique used to filter and enhance the structure of complex networks by identifying and retaining the most significant connections while removing weaker or less important edges. This process helps to reveal the core relationships within a network, allowing for a clearer understanding of its underlying structure. By applying various methods, such as disparity filtering, backboning emphasizes strong interactions and eliminates noise, making the resulting network easier to analyze and interpret. This is particularly useful in various fields, including social network analysis, biology, and finance, where understanding the critical connections can provide valuable insights into system dynamics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of nodes before backboning: 100\n",
      "Number of edges before backboning: 4771\n"
     ]
    }
   ],
   "source": [
    "print(f\"Number of nodes before backboning: {company_projection.number_of_nodes()}\")\n",
    "print(f\"Number of edges before backboning: {company_projection.number_of_edges()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Calculating DF score...\n"
     ]
    }
   ],
   "source": [
    "# Read edges and apply backboning\n",
    "table, _, _ = backboning.read(\"company_projection.txt\", column_of_interest='Weight', undirected=True)\n",
    "bb_df = backboning.thresholding(backboning.disparity_filter(table), 0.65)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The backboning library offers several methods for edge filtering, allowing you to choose the approach that best fits your data and research objectives. In this example, we use disparity filtering, but other methods may also be applicable.\n",
    "\n",
    "Additionally, the threshold value (0.65 in this case) can be adjusted based on your specific analysis needs. A higher threshold will result in a sparser backbone, while a lower threshold may retain more edges, potentially capturing more relationships but also including more noise.\n",
    "\n",
    "For a complete overview of the available backboning methods and options for threshold settings, please refer to Michele Coscia's Backboning Page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save Backboned DataFrame and Create NetworkX Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save backboned DataFrame and convert to NetworkX graph\n",
    "bb_df.to_csv(\"backboned_dataframe_companies.csv\", index=False)\n",
    "backbone_graph = nx.from_pandas_edgelist(bb_df, source='src', target='trg', edge_attr=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save and Summarize the Backboned Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of nodes after backboning: 100\n",
      "Number of edges after backboning: 2197\n"
     ]
    }
   ],
   "source": [
    "# Save and summarize the backboned graph\n",
    "nx.write_edgelist(backbone_graph, \"backboned_network_companies.edgelist\")\n",
    "print(f\"Number of nodes after backboning: {backbone_graph.number_of_nodes()}\")\n",
    "print(f\"Number of edges after backboning: {backbone_graph.number_of_edges()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiment with different approaches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You are encouraged to experiment with different backboning methods and threshold values to examine how many edges are retained in the backbone. Observing these changes can provide valuable insights into the nature of the relationships in your network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EXERCISE 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the weekly Bitcoin transaction dataset, and plot the degree distributions (i.e., in-degrees, out-degrees, total degrees).\n",
    "\n",
    "HINT: use the function <a href="https://opencourse.inf.ed.ac.uk/sites/default/files/https/opencourse.inf.ed.ac.uk/dbba/2024/\&quot;https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.graphml.read_graphml.html#networkx.readwrite.graphml.read_graphml\"">nx.read_graphml()</a> to open the network. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EXERCISE 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the 5 largest hubs based on their in-degree, out-degree, and total degree. Do these three lists overlap? Discuss the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EXERCISE 3 (Optional)\n",
    "\n",
    "Compute the centrality values of the nodes in the bitcoin network, using betweenness and closeness. Do you see any differences between the two distributions?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}