Financial Networks (FINET)¶
Tutorial 7: Applications of Financial Networks¶
In this tutorial, we will analyse real-world expenditure data from a UK-based online retail store. The dataset records all transactions made between 01/12/2010 and 09/12/2011 for a UK-registered online retailer. Our focus is on grouping customers with similar spending behaviour so that the retailer can design more efficient mass marketing strategies.
We assume that the retailer operates with a relatively small marketing department, where you are responsible for developing customer newsletters. The aim of these newsletters is to highlight news and products that align with customers’ past purchasing behaviour. However, given the large customer base, it is impractical to create individualised newsletters for everyone. At the same time, the retailer does not want to send generic, one-size-fits-all campaigns.
Your task, therefore, is to design an algorithm that groups customers with similar spending categories together. This way, the retailer can produce tailored newsletters for each group, striking a balance between efficiency and personalisation in customer engagement.
The dataset was originally analysed in the following paper:
Chen, Daqing, Sai Laing Sain, and Kun Guo. "Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining." Journal of Database Marketing & Customer Strategy Management 19.3 (2012): 197-208.
Please have a look at the paper to familiarise yourself with the data and the methods they applied to analyse it.
In the original study, the authors applied k-means clustering and decision tree induction to identify the main characteristics of consumers in each segment. In this tutorial, we will build on that analysis with the tools from this course, using network-based approaches to gain additional insights into customer behaviour, product relationships, and the influence of geography on purchasing patterns.
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
PART I: Understanding the Data¶
Before we perform any network analysis, it is important to get a basic understanding of the dataset. In this section, we will perform some common data description and inspection steps to explore the structure, variables, and types of values contained in the dataset.
data = pd.read_csv("./Online_Retail.csv")
print(data.shape)
data.head()
(541909, 8)
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/10 08:26 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/10 08:26 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/10 08:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/10 08:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/10 08:26 | 3.39 | 17850.0 | United Kingdom |
Column descriptions¶
| Variable name | Data type | Description |
|---|---|---|
| Invoice | Nominal | Invoice number; a 6-digit integral number uniquely assigned to each transaction |
| StockCode | Nominal | Product (item) code; an integral number uniquely assigned to each distinct product |
| Description | Nominal | Product (item) name |
| Quantity | Numeric | The quantities of each product (item) per transaction |
| UnitPrice | Numeric | Product price per unit in sterling |
| InvoiceDate | DateTime | The day and time when each transaction was generated |
| CustomerID | Nominal | Unique code to identify each customer |
| Country | Nominal | The delivery address country of the customer |
We can use this initial inspection to identify missing values, unusual entries, or inconsistencies, which will inform the cleaning steps.
Cleaning the Data¶
Before constructing networks, we need to clean the dataset. First, we will remove all rows where Quantity or UnitPrice is less than or equal to zero, as these typically correspond to refunds or corrections, which are not of interest in this analysis.
Next, we will drop any rows where CustomerID, StockCode, or Country are missing (NaN). These variables are essential for our network construction, as they define the nodes and edges in the bipartite network.
Note that, in some studies, missing values in this dataset have been imputed, but given the large size of the dataset, removing these rows will not significantly affect our analysis and is sufficient for our purposes.
# Remove rows with Quantity or UnitPrice <= 0
df_clean = data[(data['Quantity'] > 0) & (data['UnitPrice'] > 0)]
# Drop rows with missing CustomerID, StockCode, or Country
df_clean = df_clean.dropna(subset=['CustomerID', 'StockCode', 'Country'])
df_clean = df_clean.reset_index(drop=True)
# Summary of cleaned dataset
print("Number of rows after cleaning:", len(df_clean))
print(df_clean.head())
Number of rows after cleaning: 397884
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
InvoiceDate UnitPrice CustomerID Country
0 12/1/10 08:26 2.55 17850.0 United Kingdom
1 12/1/10 08:26 3.39 17850.0 United Kingdom
2 12/1/10 08:26 2.75 17850.0 United Kingdom
3 12/1/10 08:26 3.39 17850.0 United Kingdom
4 12/1/10 08:26 3.39 17850.0 United Kingdom
Top Products by Customer Share¶
We will now look at the top 10 products in terms of the number of unique customers who purchased them and calculate the percentage of customers that bought each of these products.
# select unique customers and count products
unique_customer_products = df_clean[['CustomerID', 'Description']].drop_duplicates()
product_counts = unique_customer_products['Description'].value_counts()
# top 10 products
top10_products = product_counts.head(10)
# calculate percentage of customers who bought each product
total_customers = df_clean['CustomerID'].nunique()
top10_percent = (top10_products / total_customers) * 100
# results
top10_df = pd.DataFrame({
'Product': top10_products.index,
'Number of Customers': top10_products.values,
'Percentage of Customers': top10_percent.values
})
print(top10_df)
Product Number of Customers \ 0 REGENCY CAKESTAND 3 TIER 881 1 WHITE HANGING HEART T-LIGHT HOLDER 856 2 PARTY BUNTING 708 3 ASSORTED COLOUR BIRD ORNAMENT 678 4 SET OF 3 CAKE TINS PANTRY DESIGN 640 5 PACK OF 72 RETROSPOT CAKE CASES 635 6 JUMBO BAG RED RETROSPOT 635 7 PAPER CHAIN KIT 50'S CHRISTMAS 613 8 NATURAL SLATE HEART CHALKBOARD 587 9 BAKING SET 9 PIECE RETROSPOT 581 Percentage of Customers 0 20.308898 1 19.732596 2 16.320885 3 15.629322 4 14.753343 5 14.638082 6 14.638082 7 14.130936 8 13.531581 9 13.393269
Top Products by Total Revenue¶
We now look at the products that generated the highest overall revenue, calculated as Quantity × UnitPrice.
# Calculate revenue per product
df_clean['Revenue'] = df_clean['Quantity'] * df_clean['UnitPrice']
product_revenue = df_clean.groupby('Description')['Revenue'].sum().sort_values(ascending=False)
# Top 10 products by revenue
top10_revenue = product_revenue.head(10)
print(top10_revenue)
Description PAPER CRAFT , LITTLE BIRDIE 168469.60 REGENCY CAKESTAND 3 TIER 142592.95 WHITE HANGING HEART T-LIGHT HOLDER 100448.15 JUMBO BAG RED RETROSPOT 85220.78 MEDIUM CERAMIC TOP STORAGE JAR 81416.73 POSTAGE 77803.96 PARTY BUNTING 68844.33 ASSORTED COLOUR BIRD ORNAMENT 56580.34 Manual 53779.93 RABBIT NIGHT LIGHT 51346.20 Name: Revenue, dtype: float64
Product Diversity per Customer¶
We also examine the average number of unique products purchased per customer, giving us insight into how diverse customer baskets are.
customer_diversity = df_clean.groupby('CustomerID')['Description'].nunique()
print(customer_diversity.describe())
# top 10 most diverse customers
print(customer_diversity.sort_values(ascending=False).head(10))
count 4338.000000 mean 61.845320 std 86.223641 min 1.000000 25% 16.000000 50% 35.500000 75% 78.000000 max 1816.000000 Name: Description, dtype: float64 CustomerID 14911.0 1816 12748.0 1778 17841.0 1345 14096.0 1129 14298.0 891 14606.0 826 14156.0 730 14769.0 724 14646.0 718 13089.0 662 Name: Description, dtype: int64
These three perspectives provide very different insights into product and customer behaviour. The Top Products by Customer Share highlights items that are most broadly appealing across the customer base, such as the Regency Cakestand 3 Tier and White Hanging Heart T-Light Holder, which were purchased by more than 800 customers each. These items represent products with wide market penetration, making them strong candidates for inclusion in broad-based marketing materials. In contrast, the Top Products by Total Revenue points to products that generate the most income overall. Here, the leaders include Paper Craft, Little Birdie and Regency Cakestand 3 Tier, which contributed disproportionately high revenue despite not necessarily being bought by the most customers. This suggests that revenue-leading products may rely more on high-value purchases or bulk orders, rather than wide customer uptake.
The Product Diversity per Customer metric adds yet another dimension by showing how varied individual customers’ baskets are. While the average customer purchased around 62 unique products, some bought well over 1,000, indicating a small group of highly engaged, diverse shoppers. Together, these three perspectives demonstrate that customer behaviour cannot be fully understood by looking at only one measure. Other useful metrics could include seasonality of purchases, repeat buying rates for products, or the longevity of customer engagement over time. Exploring these dimensions is important because a richer understanding of the dataset helps avoid misleading conclusions and helps us design subsequent analyses, such as community detection or marketing segmentation, grounded in a realistic view of customer and product dynamics.
Other Characteristics of the Dataset¶
This is a rich dataset that offers many possibilities for analysis. Beyond the number of customers and transactions, we can explore product characteristics, unit prices, and the quantities purchased. We can also study temporal patterns, such as the time of day or seasonality of purchases, as well as time series of purchase patterns for specific products. In this tutorial, we will focus on customers and the amount they spend on products. However, there is considerable scope to extend the analysis to other aspects of the dataset for more insights.
PART II: Constructing a Suitable Network and Identifying Communities of Common Spending¶
Construct the network¶
In this exercise, we will explore customer spending behaviour by constructing a bipartite network. On one side of the network, we will have all unique customers in the data, and on the other side, all the unique products they have purchased. A link will connect a customer to a product if they have bought it, and the weight of the link will be the total amount spent on that product (calculated as Quantity × UnitPrice). In the space below, you need to construct this network. You can use resources from the previous tutorials, where we have created bipartite graphs and assiged edge weights based on data.
customers = df_clean['CustomerID'].unique()
products = df_clean['StockCode'].unique()
B = nx.Graph()
# Add customer nodes (bipartite=0)
B.add_nodes_from(customers, bipartite=0)
# Add product nodes (bipartite=1)
B.add_nodes_from(products, bipartite=1)
# Add edges with weights = Quantity * UnitPrice
edges = [(row['CustomerID'], row['StockCode'], row['Quantity'] * row['UnitPrice'])
for idx, row in df_clean.iterrows()]
B.add_weighted_edges_from(edges)
print("Number of customer nodes:", len(customers))
print("Number of product nodes:", len(products))
print("Number of edges:", B.number_of_edges())
Number of customer nodes: 4338 Number of product nodes: 3665 Number of edges: 266792
Project the network¶
Since our main focus is on understanding consumer behaviour, we need to create a network projection onto the customer side. This projection will connect customers who have purchased the same products, with edge weights reflecting the strength of their shared purchasing behaviour. You should choose a suitable projection algorithm, as introduced in Tutorial 6, and complete the projection to obtain a unipartite network of customers.
import network_map2 as nm2
Gp_cosine = nm2.cosine(B, customers)
# Save projection edge list for future use
edges = nx.to_pandas_edgelist(Gp_cosine)
edges = edges.rename(columns={'source': 'src', 'target': 'trg'})
edges.to_csv("customer_cosine_projection_edges.csv", index=False)
Check basic network statistics¶
Before proceeding with further analysis, it is important to examine some key network statistics for the projected customer network. This will help you decide whether the network contains too much noise or if it is suitable for downstream analysis. Consider the size of the network (number of nodes and edges) and density of the resulting projection. In this step, you should compute these key metrics and create a summary of the distributins of edge weights. These statistics will give you an initial sense of the network structure and help inform whether any filtering might be necessary.
# Load edge list (or use the data from before)
edges = pd.read_csv("customer_cosine_projection_edges.csv")
# Build NetworkX graph
Gp_cosine = nx.from_pandas_edgelist(
edges,
source='src',
target='trg',
edge_attr='weight' # use 'weight' from CSV
)
num_nodes = Gp_cosine.number_of_nodes()
num_edges = Gp_cosine.number_of_edges()
density = nx.density(Gp_cosine)
# Edge weights
weights = [d['weight'] for _, _, d in Gp_cosine.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
mean_weight = sum(weights) / len(weights)
# Print metrics
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
print(f"Density: {density:.4f}")
print(f"Edge weight - min: {min_weight}, max: {max_weight}, mean: {mean_weight:.2f}")
# distribution of edge weights
plt.figure(figsize=(8, 5))
plt.hist(weights, bins=50, color='skyblue', edgecolor='black')
plt.xlabel("Edge Weight (Total Spending)")
plt.ylabel("Frequency")
plt.title("Distribution of Edge Weights in Customer Projection")
plt.yscale('log') # optional for skewed distribution
plt.show()
Number of nodes: 4338 Number of edges: 5048234 Density: 0.5366 Edge weight - min: 6.4596423765550526e-09, max: 1.0, mean: 0.06
The projected customer network is very dense, with 4,338 nodes and over 5 million edges, giving a density of 0.54. The edge weights vary widely, from extremely small values around 6.46×10⁻⁹ up to 1.0, with a mean of 0.06. This indicates that many of the connections are very weak and likely reflect noise rather than meaningful co-purchasing behaviour. Because of this high density and prevalence of weak links, directly applying a community detection algorithm could produce misleading results or obscure significant patterns. To address this, we decide to filter the network using the Noise-Corrected backboning method, which preserves the statistically significant edges while removing weaker, less informative connections.
Decide whether to filter¶
Based on your analysis of the key network metrics, you should decide whether to filter the projected network before performing further analysis. This decision depends on whether you believe the network contains noise, such as many weak or statistically insignificant links, that could obscure meaningful patterns or hinder community detection. If the network appears dense or contains a large number of low-weight edges, filtering may be necessary to highlight the most relevant relationships between customers. If you decide to filter the network, choose an appropriate algorithm to retain only the most significant connections while preserving the overall network structure.
import backboning
# Read in the table from Gp_cosine
table, nnodes, nnedges = backboning.read("customer_cosine_projection_edges.csv", "weight", sep=",")
# Apply Noise-Corrected Backboning
nc_table = backboning.noise_corrected(table, undirected = True)
# Apply thresholding
threshold_value = 0.2 #we can define different threshold values
nc_backbone = backboning.thresholding(nc_table, threshold_value)
G_backbone = nx.from_pandas_edgelist(
nc_backbone,
source='src',
target='trg',
edge_attr='nij'
)
# Write the backbone to file
backboning.write(nc_backbone, "cosine_projection", "nc", ".")
Calculating NC score...
import matplotlib.pyplot as plt
num_nodes = G_backbone.number_of_nodes()
num_edges = G_backbone.number_of_edges()
density = nx.density(G_backbone)
# Get edge weights
weights = [d['nij'] for u, v, d in G_backbone.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
mean_weight = sum(weights) / len(weights)
# Print metrics
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
print(f"Density: {density:.4f}")
print(f"Edge weight - min: {min_weight}, max: {max_weight}, mean: {mean_weight:.2f}")
# --- Plot distribution of edge weights ---
plt.figure(figsize=(8, 5))
plt.hist(weights, bins=50, color='skyblue', edgecolor='black')
plt.xlabel("Edge Weight (Total Spending)")
plt.ylabel("Frequency")
plt.title("Distribution of Edge Weights in Customer Projection")
plt.yscale('log') # useful for skewed distributions
plt.show()
Number of nodes: 4337 Number of edges: 884497 Density: 0.0941 Edge weight - min: 3.799947783322821e-06, max: 1.0, mean: 0.10
After applying noise-corrected backboning, the number of nodes only decreased slightly, from 4338 to 4337, meaning we lost just 1 node. In contrast, the number of edges dropped dramatically from 5,048,234 to 884,497. This corresponds to a reduction of approximately 85% in the number of edges.
Similarly, the network density decreased from 0.5366 to 0.0941, reflecting a much sparser graph where only the most significant connections remain.
The distribution of edge weights has also changed noticeably. Before filtering, most edges had very small weights, resulting in a highly skewed distribution with a mean of 0.06. After backboning, the mean weight increased to 0.1, and the weaker, less significant edges were removed.
For brevity, we do not perform a full analysis of how different threshold values affect the backboned network structure, though it is generally recommended to do so, as otherwise the choice of threshold can be somewhat arbitrary. Here, we selected a threshold that leads to only a minor reduction in nodes while removing the majority of weak, noisy edges, preserving the most significant connections for subsequent community analysis.
Community Detection¶
At this stage, choose a community detection algorithm of your choice to identify groups of customers with similar spending patterns in the network, taking the edge weights into account. Keep in mind the size of the network and the computational time required, as some algorithms can be very slow for large networks. The goal is to partition the customers into communities that reflect meaningful patterns in their spending behaviour.
from collections import Counter
# Using Louvain to detect communities with edge weights
communities = nx.community.louvain_communities(G_backbone, weight='nij')
# Number of communities
num_communities = len(communities)
print("Number of detected communities:", num_communities)
# Count number of nodes in each community
community_sizes = [len(c) for c in communities]
print("Number of nodes per community:", community_sizes)
partition_dict = {}
for i, comm in enumerate(partition): # i is community number, comm is set of nodes
for node in comm:
partition_dict[node] = i
Number of detected communities: 8 Number of nodes per community: [821, 1139, 369, 332, 409, 31, 1206, 30]
PART III: Analysing Results and Drawing Insights¶
Now that we have constructed the customer network, decided whether to apply filtering, and detected communities of customers with similar spending patterns, we can begin analysing the results. At this stage, the goal is to interpret what these communities represent and understand the distribution of spending across different products. By examining which products are most popular within each community, we can identify groups of customers with similar interests and purchasing habits. This information can then be used to design targeted marketing strategies, such as tailored newsletters, allowing the retailer to efficiently personalise communications without having to create individual messages for every customer. Through this approach, we can deliver relevant product updates and promotions to customer groups, maximising engagement and marketing impact.
Commonly purchased products in communities¶
Refer back to the original dataset and create a function that takes a community number as input. This function should filter the transactions to include only the customers in the specified community, then produce two histograms: one showing the top 10 products purchased by the community expressed as a percentage of customers who bought each product, and another showing the top 10 products by total revenue generated within the community. These visualisations will allow you to explore both the popularity of products and the revenue contributions of different items for each detected community.
def plot_community_insights(community_number):
community_customers = [cust for cust, comm in partition_dict.items() if comm == community_number]
# Filter cleaned data for these customers
community_data = df_clean[df_clean['CustomerID'].isin(community_customers)]
total_customers = len(set(community_data['CustomerID']))
# histogram of top 10 products by number of customers who bought
product_customer_counts = community_data.groupby('Description')['CustomerID'].nunique()
product_customer_percent = 100 * product_customer_counts / total_customers
top10_products_customers = product_customer_percent.sort_values(ascending=False).head(10)
plt.figure(figsize=(12,5))
plt.bar(top10_products_customers.index, top10_products_customers.values, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.ylabel("Percentage of Customers in Community")
plt.title(f"Top 10 Products Bought by Customers - Community {community_number}")
plt.show()
# histogram of top 10 products by revenue
community_data['Revenue'] = community_data['Quantity'] * community_data['UnitPrice']
product_revenue = community_data.groupby('Description')['Revenue'].sum()
top10_products_revenue = product_revenue.sort_values(ascending=False).head(10)
plt.figure(figsize=(12,5))
plt.bar(top10_products_revenue.index, top10_products_revenue.values, color='coral')
plt.xticks(rotation=45, ha='right')
plt.ylabel("Total Revenue (£)")
plt.title(f"Top 10 Products by Revenue - Community {community_number}")
plt.show()
# Choose diff comm here:
plot_community_insights(3) # Change 0 to any other comm number
/var/folders/jt/85zx9l2d6bx7j4s0g26q8t4c0000gn/T/ipykernel_39643/2673238730.py:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy community_data['Revenue'] = community_data['Quantity'] * community_data['UnitPrice']
The results you obtain may differ depending on the choice of projection algorithm, whether you apply filtering or not, the parameters used for filtering, and the community detection algorithm you select. The following questions are therefore somewhat open-ended, aiming to capture insights across different scenarios. There is no single "correct" solution we are looking for, as different choices and approaches may lead to valid but different results.
Questions¶
- Which products are most commonly purchased by the largest community, and how does this compare to the products generating the highest revenue?
- Are there any communities where the spending patterns are highly concentrated on a few products versus more evenly spread across many products?
- How do the sizes of the communities (number of customers) relate to the total revenue contributed by each community?
- Do you observe any unexpected patterns, such as smaller communities generating disproportionately high revenue, or products generating disproportinate revenue within one community? What might explain this?
- How might the results change if a different projection algorithm or filtering threshold had been used, and what does this tell you about the robustness of your findings?
Optional Additional Analysis (If Time Permits)¶
If you have extra time, consider exploring additional metrics from the dataset that could inform the retailer’s marketing strategy. Some ideas include the country of the customer and how it might influence behaviour, seasonality of purchases, timing between purchases to better schedule newsletters, or patterns in product categories purchased together. We do not provide specific solutions for these, as there are many possible avenues to explore and we want you to think creatively about how to extract insights from the data. More specific ideas:
- The country of the customer and how it might influence purchasing behaviour.
- Seasonality of purchases, such as peak months or holidays, which could help plan marketing campaigns.
- Timing between purchases for individual customers, which could inform the optimal timing for newsletters.
- Product bundling patterns: which products are frequently bought together, which could help design bundle offers or targeted promotions.
This section is intentionally open-ended. The goal is to encourage you to think creatively and demonstrate how many possibilities exist for extracting actionable insights from real-world data.