Financial Networks (FINET)¶

Tutorial 7: Applications of Financial Networks¶

In this tutorial, we will analyse real-world expenditure data from a UK-based online retail store. The dataset records all transactions made between 01/12/2010 and 09/12/2011 for a UK-registered online retailer. Our focus is on grouping customers with similar spending behaviour so that the retailer can design more efficient mass marketing strategies.

We assume that the retailer operates with a relatively small marketing department, where you are responsible for developing customer newsletters. The aim of these newsletters is to highlight news and products that align with customers’ past purchasing behaviour. However, given the large customer base, it is impractical to create individualised newsletters for everyone. At the same time, the retailer does not want to send generic, one-size-fits-all campaigns.

Your task, therefore, is to design an algorithm that groups customers with similar spending categories together. This way, the retailer can produce tailored newsletters for each group, striking a balance between efficiency and personalisation in customer engagement.

The dataset was originally analysed in the following paper:

Chen, Daqing, Sai Laing Sain, and Kun Guo. "Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining." Journal of Database Marketing & Customer Strategy Management 19.3 (2012): 197-208.

Please have a look at the paper to familiarise yourself with the data and the methods they applied to analyse it.

In the original study, the authors applied k-means clustering and decision tree induction to identify the main characteristics of consumers in each segment. In this tutorial, we will build on that analysis with the tools from this course, using network-based approaches to gain additional insights into customer behaviour, product relationships, and the influence of geography on purchasing patterns.

In [1]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

PART I: Understanding the Data¶

Before we perform any network analysis, it is important to get a basic understanding of the dataset. In this section, we will perform some common data description and inspection steps to explore the structure, variables, and types of values contained in the dataset.

In [2]:
data = pd.read_csv("./Online_Retail.csv")
print(data.shape)
data.head()
(541909, 8)
Out[2]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/10 08:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/10 08:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/10 08:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/10 08:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/10 08:26 3.39 17850.0 United Kingdom

Column descriptions¶

Variable name Data type Description
Invoice Nominal Invoice number; a 6-digit integral number uniquely assigned to each transaction
StockCode Nominal Product (item) code; an integral number uniquely assigned to each distinct product
Description Nominal Product (item) name
Quantity Numeric The quantities of each product (item) per transaction
UnitPrice Numeric Product price per unit in sterling
InvoiceDate DateTime The day and time when each transaction was generated
CustomerID Nominal Unique code to identify each customer
Country Nominal The delivery address country of the customer

We can use this initial inspection to identify missing values, unusual entries, or inconsistencies, which will inform the cleaning steps.

Cleaning the Data¶

Before constructing networks, we need to clean the dataset. First, we will remove all rows where Quantity or UnitPrice is less than or equal to zero, as these typically correspond to refunds or corrections, which are not of interest in this analysis.

Next, we will drop any rows where CustomerID, StockCode, or Country are missing (NaN). These variables are essential for our network construction, as they define the nodes and edges in the bipartite network.

Note that, in some studies, missing values in this dataset have been imputed, but given the large size of the dataset, removing these rows will not significantly affect our analysis and is sufficient for our purposes.

In [3]:
# Remove rows with Quantity or UnitPrice <= 0
df_clean = data[(data['Quantity'] > 0) & (data['UnitPrice'] > 0)]

# Drop rows with missing CustomerID, StockCode, or Country
df_clean = df_clean.dropna(subset=['CustomerID', 'StockCode', 'Country'])
df_clean = df_clean.reset_index(drop=True)

# Summary of cleaned dataset
print("Number of rows after cleaning:", len(df_clean))
print(df_clean.head())
Number of rows after cleaning: 397884
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

     InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/10 08:26       2.55     17850.0  United Kingdom  
1  12/1/10 08:26       3.39     17850.0  United Kingdom  
2  12/1/10 08:26       2.75     17850.0  United Kingdom  
3  12/1/10 08:26       3.39     17850.0  United Kingdom  
4  12/1/10 08:26       3.39     17850.0  United Kingdom  

Top Products by Customer Share¶

We will now look at the top 10 products in terms of the number of unique customers who purchased them and calculate the percentage of customers that bought each of these products.

In [4]:
# select unique customers and count products
unique_customer_products = df_clean[['CustomerID', 'Description']].drop_duplicates()
product_counts = unique_customer_products['Description'].value_counts()

# top 10 products
top10_products = product_counts.head(10)

# calculate percentage of customers who bought each product
total_customers = df_clean['CustomerID'].nunique()
top10_percent = (top10_products / total_customers) * 100

# results
top10_df = pd.DataFrame({
    'Product': top10_products.index,
    'Number of Customers': top10_products.values,
    'Percentage of Customers': top10_percent.values
})

print(top10_df)
                              Product  Number of Customers  \
0            REGENCY CAKESTAND 3 TIER                  881   
1  WHITE HANGING HEART T-LIGHT HOLDER                  856   
2                       PARTY BUNTING                  708   
3       ASSORTED COLOUR BIRD ORNAMENT                  678   
4   SET OF 3 CAKE TINS PANTRY DESIGN                   640   
5     PACK OF 72 RETROSPOT CAKE CASES                  635   
6             JUMBO BAG RED RETROSPOT                  635   
7     PAPER CHAIN KIT 50'S CHRISTMAS                   613   
8     NATURAL SLATE HEART CHALKBOARD                   587   
9       BAKING SET 9 PIECE RETROSPOT                   581   

   Percentage of Customers  
0                20.308898  
1                19.732596  
2                16.320885  
3                15.629322  
4                14.753343  
5                14.638082  
6                14.638082  
7                14.130936  
8                13.531581  
9                13.393269  

Top Products by Total Revenue¶

We now look at the products that generated the highest overall revenue, calculated as Quantity × UnitPrice.

In [5]:
# Calculate revenue per product
df_clean['Revenue'] = df_clean['Quantity'] * df_clean['UnitPrice']
product_revenue = df_clean.groupby('Description')['Revenue'].sum().sort_values(ascending=False)

# Top 10 products by revenue
top10_revenue = product_revenue.head(10)
print(top10_revenue)
Description
PAPER CRAFT , LITTLE BIRDIE           168469.60
REGENCY CAKESTAND 3 TIER              142592.95
WHITE HANGING HEART T-LIGHT HOLDER    100448.15
JUMBO BAG RED RETROSPOT                85220.78
MEDIUM CERAMIC TOP STORAGE JAR         81416.73
POSTAGE                                77803.96
PARTY BUNTING                          68844.33
ASSORTED COLOUR BIRD ORNAMENT          56580.34
Manual                                 53779.93
RABBIT NIGHT LIGHT                     51346.20
Name: Revenue, dtype: float64

Product Diversity per Customer¶

We also examine the average number of unique products purchased per customer, giving us insight into how diverse customer baskets are.

In [6]:
customer_diversity = df_clean.groupby('CustomerID')['Description'].nunique()
print(customer_diversity.describe())

# top 10 most diverse customers
print(customer_diversity.sort_values(ascending=False).head(10))
count    4338.000000
mean       61.845320
std        86.223641
min         1.000000
25%        16.000000
50%        35.500000
75%        78.000000
max      1816.000000
Name: Description, dtype: float64
CustomerID
14911.0    1816
12748.0    1778
17841.0    1345
14096.0    1129
14298.0     891
14606.0     826
14156.0     730
14769.0     724
14646.0     718
13089.0     662
Name: Description, dtype: int64

These three perspectives provide very different insights into product and customer behaviour. The Top Products by Customer Share highlights items that are most broadly appealing across the customer base, such as the Regency Cakestand 3 Tier and White Hanging Heart T-Light Holder, which were purchased by more than 800 customers each. These items represent products with wide market penetration, making them strong candidates for inclusion in broad-based marketing materials. In contrast, the Top Products by Total Revenue points to products that generate the most income overall. Here, the leaders include Paper Craft, Little Birdie and Regency Cakestand 3 Tier, which contributed disproportionately high revenue despite not necessarily being bought by the most customers. This suggests that revenue-leading products may rely more on high-value purchases or bulk orders, rather than wide customer uptake.

The Product Diversity per Customer metric adds yet another dimension by showing how varied individual customers’ baskets are. While the average customer purchased around 62 unique products, some bought well over 1,000, indicating a small group of highly engaged, diverse shoppers. Together, these three perspectives demonstrate that customer behaviour cannot be fully understood by looking at only one measure. Other useful metrics could include seasonality of purchases, repeat buying rates for products, or the longevity of customer engagement over time. Exploring these dimensions is important because a richer understanding of the dataset helps avoid misleading conclusions and helps us design subsequent analyses, such as community detection or marketing segmentation, grounded in a realistic view of customer and product dynamics.

Other Characteristics of the Dataset¶

This is a rich dataset that offers many possibilities for analysis. Beyond the number of customers and transactions, we can explore product characteristics, unit prices, and the quantities purchased. We can also study temporal patterns, such as the time of day or seasonality of purchases, as well as time series of purchase patterns for specific products. In this tutorial, we will focus on customers and the amount they spend on products. However, there is considerable scope to extend the analysis to other aspects of the dataset for more insights.

PART II: Constructing a Suitable Network and Identifying Communities of Common Spending¶

Construct the network¶

In this exercise, we will explore customer spending behaviour by constructing a bipartite network. On one side of the network, we will have all unique customers in the data, and on the other side, all the unique products they have purchased. A link will connect a customer to a product if they have bought it, and the weight of the link will be the total amount spent on that product (calculated as Quantity × UnitPrice). In the space below, you need to construct this network. You can use resources from the previous tutorials, where we have created bipartite graphs and assiged edge weights based on data.

In [7]:
customers = df_clean['CustomerID'].unique()
products = df_clean['StockCode'].unique()

B = nx.Graph()

# Add customer nodes (bipartite=0)
B.add_nodes_from(customers, bipartite=0)

# Add product nodes (bipartite=1)
B.add_nodes_from(products, bipartite=1)

# Add edges with weights = Quantity * UnitPrice
edges = [(row['CustomerID'], row['StockCode'], row['Quantity'] * row['UnitPrice']) 
         for idx, row in df_clean.iterrows()]

B.add_weighted_edges_from(edges)

print("Number of customer nodes:", len(customers))
print("Number of product nodes:", len(products))
print("Number of edges:", B.number_of_edges())
Number of customer nodes: 4338
Number of product nodes: 3665
Number of edges: 266792

Project the network¶

Since our main focus is on understanding consumer behaviour, we need to create a network projection onto the customer side. This projection will connect customers who have purchased the same products, with edge weights reflecting the strength of their shared purchasing behaviour. You should choose a suitable projection algorithm, as introduced in Tutorial 6, and complete the projection to obtain a unipartite network of customers.

In [8]:
import network_map2 as nm2

Gp_cosine = nm2.cosine(B, customers)

# Save projection edge list for future use
edges = nx.to_pandas_edgelist(Gp_cosine)
edges = edges.rename(columns={'source': 'src', 'target': 'trg'})
edges.to_csv("customer_cosine_projection_edges.csv", index=False)

Check basic network statistics¶

Before proceeding with further analysis, it is important to examine some key network statistics for the projected customer network. This will help you decide whether the network contains too much noise or if it is suitable for downstream analysis. Consider the size of the network (number of nodes and edges) and density of the resulting projection. In this step, you should compute these key metrics and create a summary of the distributins of edge weights. These statistics will give you an initial sense of the network structure and help inform whether any filtering might be necessary.

In [9]:
# Load edge list (or use the data from before)
edges = pd.read_csv("customer_cosine_projection_edges.csv")

# Build NetworkX graph
Gp_cosine = nx.from_pandas_edgelist(
    edges,
    source='src',
    target='trg',
    edge_attr='weight'  # use 'weight' from CSV
)

num_nodes = Gp_cosine.number_of_nodes()
num_edges = Gp_cosine.number_of_edges()
density = nx.density(Gp_cosine)

# Edge weights
weights = [d['weight'] for _, _, d in Gp_cosine.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
mean_weight = sum(weights) / len(weights)

# Print metrics
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
print(f"Density: {density:.4f}")
print(f"Edge weight - min: {min_weight}, max: {max_weight}, mean: {mean_weight:.2f}")

# distribution of edge weights
plt.figure(figsize=(8, 5))
plt.hist(weights, bins=50, color='skyblue', edgecolor='black')
plt.xlabel("Edge Weight (Total Spending)")
plt.ylabel("Frequency")
plt.title("Distribution of Edge Weights in Customer Projection")
plt.yscale('log')  # optional for skewed distribution
plt.show()
Number of nodes: 4338
Number of edges: 5048234
Density: 0.5366
Edge weight - min: 6.4596423765550526e-09, max: 1.0, mean: 0.06
No description has been provided for this image

The projected customer network is very dense, with 4,338 nodes and over 5 million edges, giving a density of 0.54. The edge weights vary widely, from extremely small values around 6.46×10⁻⁹ up to 1.0, with a mean of 0.06. This indicates that many of the connections are very weak and likely reflect noise rather than meaningful co-purchasing behaviour. Because of this high density and prevalence of weak links, directly applying a community detection algorithm could produce misleading results or obscure significant patterns. To address this, we decide to filter the network using the Noise-Corrected backboning method, which preserves the statistically significant edges while removing weaker, less informative connections.

Decide whether to filter¶

Based on your analysis of the key network metrics, you should decide whether to filter the projected network before performing further analysis. This decision depends on whether you believe the network contains noise, such as many weak or statistically insignificant links, that could obscure meaningful patterns or hinder community detection. If the network appears dense or contains a large number of low-weight edges, filtering may be necessary to highlight the most relevant relationships between customers. If you decide to filter the network, choose an appropriate algorithm to retain only the most significant connections while preserving the overall network structure.

In [10]:
import backboning

# Read in the table from Gp_cosine
table, nnodes, nnedges = backboning.read("customer_cosine_projection_edges.csv", "weight", sep=",")

# Apply Noise-Corrected Backboning
nc_table = backboning.noise_corrected(table, undirected = True)

# Apply thresholding
threshold_value = 0.2 #we can define different threshold values
nc_backbone = backboning.thresholding(nc_table, threshold_value) 

G_backbone = nx.from_pandas_edgelist(
    nc_backbone,
    source='src',
    target='trg',
    edge_attr='nij'
)

# Write the backbone to file
backboning.write(nc_backbone, "cosine_projection", "nc", ".")
Calculating NC score...
In [11]:
import matplotlib.pyplot as plt
num_nodes = G_backbone.number_of_nodes()
num_edges = G_backbone.number_of_edges()
density = nx.density(G_backbone)

# Get edge weights
weights = [d['nij'] for u, v, d in G_backbone.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
mean_weight = sum(weights) / len(weights)

# Print metrics
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
print(f"Density: {density:.4f}")
print(f"Edge weight - min: {min_weight}, max: {max_weight}, mean: {mean_weight:.2f}")

# --- Plot distribution of edge weights ---
plt.figure(figsize=(8, 5))
plt.hist(weights, bins=50, color='skyblue', edgecolor='black')
plt.xlabel("Edge Weight (Total Spending)")
plt.ylabel("Frequency")
plt.title("Distribution of Edge Weights in Customer Projection")
plt.yscale('log')  # useful for skewed distributions
plt.show()
Number of nodes: 4337
Number of edges: 884497
Density: 0.0941
Edge weight - min: 3.799947783322821e-06, max: 1.0, mean: 0.10
No description has been provided for this image

After applying noise-corrected backboning, the number of nodes only decreased slightly, from 4338 to 4337, meaning we lost just 1 node. In contrast, the number of edges dropped dramatically from 5,048,234 to 884,497. This corresponds to a reduction of approximately 85% in the number of edges.

Similarly, the network density decreased from 0.5366 to 0.0941, reflecting a much sparser graph where only the most significant connections remain.

The distribution of edge weights has also changed noticeably. Before filtering, most edges had very small weights, resulting in a highly skewed distribution with a mean of 0.06. After backboning, the mean weight increased to 0.1, and the weaker, less significant edges were removed.

For brevity, we do not perform a full analysis of how different threshold values affect the backboned network structure, though it is generally recommended to do so, as otherwise the choice of threshold can be somewhat arbitrary. Here, we selected a threshold that leads to only a minor reduction in nodes while removing the majority of weak, noisy edges, preserving the most significant connections for subsequent community analysis.

Community Detection¶

At this stage, choose a community detection algorithm of your choice to identify groups of customers with similar spending patterns in the network, taking the edge weights into account. Keep in mind the size of the network and the computational time required, as some algorithms can be very slow for large networks. The goal is to partition the customers into communities that reflect meaningful patterns in their spending behaviour.

In [18]:
from collections import Counter

# Using Louvain to detect communities with edge weights
communities = nx.community.louvain_communities(G_backbone, weight='nij')

# Number of communities
num_communities = len(communities)
print("Number of detected communities:", num_communities)

# Count number of nodes in each community
community_sizes = [len(c) for c in communities]
print("Number of nodes per community:", community_sizes)

partition_dict = {}
for i, comm in enumerate(partition):  # i is community number, comm is set of nodes
    for node in comm:
        partition_dict[node] = i
Number of detected communities: 8
Number of nodes per community: [821, 1139, 369, 332, 409, 31, 1206, 30]

PART III: Analysing Results and Drawing Insights¶

Now that we have constructed the customer network, decided whether to apply filtering, and detected communities of customers with similar spending patterns, we can begin analysing the results. At this stage, the goal is to interpret what these communities represent and understand the distribution of spending across different products. By examining which products are most popular within each community, we can identify groups of customers with similar interests and purchasing habits. This information can then be used to design targeted marketing strategies, such as tailored newsletters, allowing the retailer to efficiently personalise communications without having to create individual messages for every customer. Through this approach, we can deliver relevant product updates and promotions to customer groups, maximising engagement and marketing impact.

Commonly purchased products in communities¶

Refer back to the original dataset and create a function that takes a community number as input. This function should filter the transactions to include only the customers in the specified community, then produce two histograms: one showing the top 10 products purchased by the community expressed as a percentage of customers who bought each product, and another showing the top 10 products by total revenue generated within the community. These visualisations will allow you to explore both the popularity of products and the revenue contributions of different items for each detected community.

In [19]:
def plot_community_insights(community_number):
    community_customers = [cust for cust, comm in partition_dict.items() if comm == community_number]
    
    # Filter cleaned data for these customers
    community_data = df_clean[df_clean['CustomerID'].isin(community_customers)]
    total_customers = len(set(community_data['CustomerID']))
    
    # histogram of top 10 products by number of customers who bought 
    product_customer_counts = community_data.groupby('Description')['CustomerID'].nunique()
    product_customer_percent = 100 * product_customer_counts / total_customers
    top10_products_customers = product_customer_percent.sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(12,5))
    plt.bar(top10_products_customers.index, top10_products_customers.values, color='skyblue')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel("Percentage of Customers in Community")
    plt.title(f"Top 10 Products Bought by Customers - Community {community_number}")
    plt.show()
    
    # histogram of top 10 products by revenue
    community_data['Revenue'] = community_data['Quantity'] * community_data['UnitPrice']
    product_revenue = community_data.groupby('Description')['Revenue'].sum()
    top10_products_revenue = product_revenue.sort_values(ascending=False).head(10)
    
    plt.figure(figsize=(12,5))
    plt.bar(top10_products_revenue.index, top10_products_revenue.values, color='coral')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel("Total Revenue (£)")
    plt.title(f"Top 10 Products by Revenue - Community {community_number}")
    plt.show()


# Choose diff comm here:
plot_community_insights(3)  # Change 0 to any other comm number
No description has been provided for this image
/var/folders/jt/85zx9l2d6bx7j4s0g26q8t4c0000gn/T/ipykernel_39643/2673238730.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  community_data['Revenue'] = community_data['Quantity'] * community_data['UnitPrice']
No description has been provided for this image

The results you obtain may differ depending on the choice of projection algorithm, whether you apply filtering or not, the parameters used for filtering, and the community detection algorithm you select. The following questions are therefore somewhat open-ended, aiming to capture insights across different scenarios. There is no single "correct" solution we are looking for, as different choices and approaches may lead to valid but different results.

Questions¶

  1. Which products are most commonly purchased by the largest community, and how does this compare to the products generating the highest revenue?
  2. Are there any communities where the spending patterns are highly concentrated on a few products versus more evenly spread across many products?
  3. How do the sizes of the communities (number of customers) relate to the total revenue contributed by each community?
  4. Do you observe any unexpected patterns, such as smaller communities generating disproportionately high revenue, or products generating disproportinate revenue within one community? What might explain this?
  5. How might the results change if a different projection algorithm or filtering threshold had been used, and what does this tell you about the robustness of your findings?

Optional Additional Analysis (If Time Permits)¶

If you have extra time, consider exploring additional metrics from the dataset that could inform the retailer’s marketing strategy. Some ideas include the country of the customer and how it might influence behaviour, seasonality of purchases, timing between purchases to better schedule newsletters, or patterns in product categories purchased together. We do not provide specific solutions for these, as there are many possible avenues to explore and we want you to think creatively about how to extract insights from the data. More specific ideas:

  • The country of the customer and how it might influence purchasing behaviour.
  • Seasonality of purchases, such as peak months or holidays, which could help plan marketing campaigns.
  • Timing between purchases for individual customers, which could inform the optimal timing for newsletters.
  • Product bundling patterns: which products are frequently bought together, which could help design bundle offers or targeted promotions.

This section is intentionally open-ended. The goal is to encourage you to think creatively and demonstrate how many possibilities exist for extracting actionable insights from real-world data.