{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "19ce7393",
   "metadata": {},
   "source": [
    "# FNLP Lab 1: Pytorch Tensors and Naive Bayes\n",
    "\n",
    "This lab aims to teach the basics of computation with pytorch tensors, and then apply them to the task of document-classification with a naive bayes classifier.\n",
    "\n",
    "This lab has been tested on two sources: [Notable](https://information-services.ed.ac.uk/learning-technology/noteable/intro) and [Google Colab](https://colab.research.google.com/) (cpu server type).\n",
    "\n",
    "#### Notable - University of Edinburgh's Notebook Server\n",
    "\n",
    "You should be able to access Notable via _Books & Course Tools_ section of FNLP's course page on [Learn](https://www.learn.ed.ac.uk). From there you can upload this file and requirements.txt and begin!\n",
    "\n",
    "#### Google Colab - Hosting on Your Own\n",
    "\n",
    "You can create a new notebook by uploading this file to Google Colab, and then can start a \"CPU\" type server. You will then need to upload the requirements.txt file before starting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7edbb098",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eb00e514",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pytorch imports\n",
    "import torch \n",
    "import torch.nn.functional as F\n",
    "\n",
    "# document classification imports\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from typing import List\n",
    "import re\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "087bd2a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /home/aag12/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "# feel free to import these, if you don't want to you can skip this\n",
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "# # Define stopwords\n",
    "stop_words = set(stopwords.words('english'))\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cb2addb",
   "metadata": {},
   "source": [
    "# Section 1. Basic Tensor Computation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44ea1dab",
   "metadata": {},
   "source": [
    "## Definition of Tensor: High-dimensional Array\n",
    "\n",
    "You can think of tensors as n-dimensional arrays. Where scalars are 0 dimensional, vectors are 1 dimensional, matrices are 2 dimensional, and tensors can be any dimension.\n",
    "\n",
    "Tensors are the default data structure used for all high-dimensional numeric data in Pytorch, and being able to manipulate them is essential for ML and NLP tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7eef9724",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([1, 3, 5, 7, 9])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# you could imagine this vector as representing token IDs for a sentence, where each number corresponds to a word\n",
    "vector = torch.tensor([1, 3, 5, 7, 9])\n",
    "vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8fff9551",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[ 1,  3,  5,  7,  9],\n",
       "        [ 2,  4,  6,  8, 10]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# this could be a batch of two sentences, each being represented by a sentence vector\n",
    "matrix = torch.tensor([[1, 3, 5, 7, 9], \n",
    "                     [2, 4, 6, 8, 10]])\n",
    "matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f4c44d96",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[[0.7025, 0.5529, 0.9438],\n",
       "         [0.8990, 0.3688, 0.7129],\n",
       "         [0.3525, 0.5985, 0.9671],\n",
       "         [0.6245, 0.0152, 0.7574],\n",
       "         [0.5776, 0.8740, 0.1351]],\n",
       "\n",
       "        [[0.4072, 0.5286, 0.2834],\n",
       "         [0.6011, 0.3586, 0.3324],\n",
       "         [0.9777, 0.7810, 0.3995],\n",
       "         [0.1651, 0.9447, 0.9333],\n",
       "         [0.6715, 0.8788, 0.2076]]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# these could be a batch of embedding matrices, where each embedding is a 5x3 matrix\n",
    "tensor = torch.rand([2, 5, 3])\n",
    "tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "710352c7",
   "metadata": {},
   "source": [
    "## Getting Sizes of Tensors\n",
    "\n",
    "The most important attribute of a tensor is its shape/size. This is the dimensions of a tensor, much like an mxn matrix has the dimensions (m, n)\n",
    "\n",
    "In the example tensor above, the `[2, 5, 3]` term defined the shape of the tensor. We would say that this tensor has a _rank_ of 3, or it is a _rank 3_ tensor, to say that it has 3 dimensions. The size of the tensor, `[2, 5, 3]` tells us how many elements are in each of those dimensions. The example tensor has 2 elements in its first dimension, 5 in its second, and 3 in its third.\n",
    "\n",
    "You get the size/shape of a pytorch tensor with the `.size()` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fdbf801c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([2, 5, 3])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([2, 5, 3])\n",
    "tensor.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "afe20159",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 4, 2])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.tensor([[[1, 3], [5, 7], \n",
    "                     [4, 6], [8, 10]]])\n",
    "tensor.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "20fbf417",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 0., 0.],\n",
       "        [0., 0., 0.]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# torch.zeros and torch.ones are very helpful initialization tools for creating 'default' tensors of a given size!\n",
    "tensor = torch.zeros([2, 3])\n",
    "tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26cff789",
   "metadata": {},
   "source": [
    "## Getting dtype of Tensors\n",
    "\n",
    "`dtype` or data-type of a tensor defines the types of values a tensor can store. Data types define the precision and memory requirements of your data; for example, a `torch.float16` type will take less space but have less precision than a `torch.float32`. It's good practice to use the smallest data type possible for your data, while keeping in mind the fact that lower precision calculations can lead to unexpected behaviour. \n",
    "\n",
    "Common types to know are `torch.bool`, `torch.int16`, `torch.float32`, `torch.float16`, and `torch.bfloat16` (another 16-bit floating point type that often more efficiently uses those bits for deep-learning purposes)\n",
    "\n",
    "You can convert between types with `tensor.to(dtype)`\n",
    "\n",
    "More details are here: https://pytorch.org/docs/stable/tensor_attributes.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "50d10ff4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0., 0.], dtype=torch.float16)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.tensor([1e-6, 1e-8], dtype=torch.float16) # float 16 isn't precise enough to handle such small numbers\n",
    "tensor / 1024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6ed8bd02",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([9.7656e-10, 9.7656e-12])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.tensor([1e-6, 1e-8], dtype=torch.float32) # float32 is precise enough, but takes double the space\n",
    "tensor / 1024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d0dd1f35",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([9.7498e-10, 9.7771e-12], dtype=torch.bfloat16)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# bfloat 16 has a better dynamic range, but produces slightly different results\n",
    "tensor = torch.tensor([1e-6, 1e-8], dtype=torch.bfloat16)\n",
    "tensor / 1024"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d59cd7dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.bool"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.tensor([[True], [False]]).dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "65037753",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[1.],\n",
       "        [0.]], dtype=torch.float16)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.tensor([[True], [False]]).to(torch.float16)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c1add4c",
   "metadata": {},
   "source": [
    "## Basic Indexing of Tensors\n",
    "\n",
    "Indexing tensors works similarly to python lists, but with the addition of the ability to index across multiple dimensions at once.\n",
    "\n",
    "To get the 2nd element in a rank 1 tensor, you could do `tensor[1]`.\n",
    "\n",
    "In a rank 2 tensor, to get the 5th element of the first dimension (which would be a vector) and the 3rd element of the second, you could do `tensor[4, 2]`\n",
    "\n",
    "This extends to any dimension tensor.\n",
    "\n",
    "One **important note:** the smallest unit you can index to is a 0-dimensional tensor. In contrast to how you would normally index a python list (where you get the value stored in the list), to get the 'raw' underlying value (sometimes useful for comparisons or printing) you need to run `zero_d_tensor.item()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "03ea99c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([5]), tensor(0.8306))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([5])\n",
    "tensor.size(), tensor[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b8406eb4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([10, 4]), tensor(0.6274))"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([10, 4])\n",
    "tensor.size(), tensor[4, 2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "b3311ea2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor(0.4287), 0.42868494987487793)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[3, 2], tensor[3, 2].item()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80748126",
   "metadata": {},
   "source": [
    "## Slicing\n",
    "\n",
    "What if you wanted to a subset of elements from a tensor? This also works similarly to python lists, but is extended to multiple dimensions.\n",
    "\n",
    "For each dimension in a tensor, you provide the slice in the dimension 'index'. For example: `tensor[0:2, 2:4]` gets the first 2 elements in the first dimension, and the 2-4 elements in the second dimension. This might represent getting the first two elements of a batch, and only their 2-4 values.\n",
    "\n",
    "By default the dimension indices go from left to right, so if you had a tensor with size `[2, 3, 4]` and sliced it with `[:1, 1:3]` you would get a resulting tensor with size `[1, 2, 4]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "53dd6eb6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([2, 3, 4]), torch.Size([1, 2, 4]))"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([2, 3, 4])\n",
    "tensor.size(), tensor[:1, 1:3].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "076b6f54",
   "metadata": {},
   "source": [
    "To get all elements from a dimension, you can also use the `:` slicing operator on its own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "a324f916",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 3, 2])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[1:2, :, :2].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c29db983",
   "metadata": {},
   "source": [
    "You can also index while slicing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "a6179ea7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([2, 1])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[:, 2, :1].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a18bb50",
   "metadata": {},
   "source": [
    "### Ellipses\n",
    "\n",
    "For extremely high dimensional tensors where you only care about dimensions at the beginning or end, you can use the `...` operator in place of using multiple `:,:,`\n",
    "\n",
    "Pytorch will get all of the elements for any dimensions not explicitly indexed/sliced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "967ec73c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([2, 3, 4, 5, 6, 7, 8])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([2, 3, 4, 5, 6, 7, 8])\n",
    "tensor.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "2dd27534",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([2, 1, 4, 5, 6, 7, 5])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[:2, :1, ..., 3:].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03e0492e",
   "metadata": {},
   "source": [
    "### Step Size\n",
    "\n",
    "Just like in python, you can also specify a step size while slicing (but of course, done across whatever dimensions you choose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "883fcbcd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0, 1, 2, 3, 4]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(range(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "0dc70edf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([2, 6])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.tensor([\n",
    "    [0, 1, 2, 3, 4, 5], \n",
    "    [10, 9, 8, 7, 6, 5], \n",
    "])\n",
    "tensor.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "46c1871a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([9, 7, 5])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[1, 1:6:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e62ff24b",
   "metadata": {},
   "source": [
    "## Masking\n",
    "\n",
    "Another way of getting a useful subset of tensors is with _masking_.\n",
    "\n",
    "First you _create_ a mask by creating a boolean (or 1s and 0s) tensor of the size of the tensor you want to select from"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "56951896",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([ True,  True,  True,  True,  True, False, False, False, False, False])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.tensor(list(range(10)))\n",
    "\n",
    "mask = tensor < 5\n",
    "mask"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b50486d8",
   "metadata": {},
   "source": [
    "Then, you _apply_ the mask (or you _mask_ out the tensor) to get the subset. Using the mask is as easy as passing it to the tensor similar to indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "237592b6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0, 1, 2, 3, 4])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c59af48b",
   "metadata": {},
   "source": [
    "You can combine boolean operations when creating your masks, allowing you to mask for multiple conditions at the same time.\n",
    "\n",
    "How you do this will depend on the types of your data, but `torch.logical_and` and `torch.logical_or` are a good place to start\n",
    "\n",
    "As a brief example, let's say I wanted to just get all values in my tensor to be between two values. I could do this very quickly with the following operation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "48644b5e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.8655, 0.8438, 0.1720, 0.3680, 0.0569, 0.8050, 0.0790, 0.6288],\n",
       "        [0.3334, 0.5405, 0.6010, 0.5248, 0.3682, 0.4315, 0.8259, 0.3815],\n",
       "        [0.5243, 0.3710, 0.6476, 0.9206, 0.6128, 0.5941, 0.0310, 0.9251],\n",
       "        [0.8565, 0.2052, 0.6951, 0.0556, 0.6423, 0.9215, 0.8420, 0.1909]])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([4, 8])\n",
    "tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "1b7cc0b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0.5405, 0.5248, 0.4315, 0.5243, 0.5941])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clip_mask = (tensor > 0.4) & (tensor < 0.6)\n",
    "tensor[clip_mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e854f8",
   "metadata": {},
   "source": [
    "You can also use masks to **set** values. Let's say I wanted to set all values outside of my range to 0 for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "33426237",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
       "        [0.0000, 0.5405, 0.0000, 0.5248, 0.0000, 0.4315, 0.0000, 0.0000],\n",
       "        [0.5243, 0.0000, 0.0000, 0.0000, 0.0000, 0.5941, 0.0000, 0.0000],\n",
       "        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clip_mask = (tensor < 0.4) | (tensor > 0.6)\n",
    "tensor[clip_mask] = 0.0\n",
    "tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9287bb0",
   "metadata": {},
   "source": [
    "# Concatenating Tensors\n",
    "\n",
    "You can concatenate tensors together via the `torch.concat` method, that takes in a list of the tensors you want to concatenate and the dimension at which you want to concatenate them together (`dim=`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "39781855",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([5]), torch.Size([5]))"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vector1 = torch.tensor(list(range(5)))\n",
    "vector2 = torch.tensor(list(range(5, 10)))\n",
    "vector1.size(), vector2.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "6dabcd0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.concat([vector1, vector2], dim=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "43b22da4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(torch.Size([1, 5]), torch.Size([1, 5]))"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vector1 = torch.tensor([list(range(5))])\n",
    "vector2 = torch.tensor([list(range(5, 10))])\n",
    "vector1.size(), vector2.size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "5363baba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0, 1, 2, 3, 4],\n",
       "        [5, 6, 7, 8, 9]])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.concat([vector1, vector2], dim=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "234ab786",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "torch.concat([vector1, vector2], dim=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f158d91",
   "metadata": {},
   "source": [
    "# Basic Math with Tensors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a17f7f4a",
   "metadata": {},
   "source": [
    "## Scalar computations\n",
    "\n",
    "The following scalar computations are done to every element of a tensor\n",
    "\n",
    "Addition: `tensor + scalar`\n",
    "\n",
    "Subtraction: `tensor - scalar`\n",
    "\n",
    "Multiplication: `tensor * scalar`\n",
    "\n",
    "Division: `tensor / scalar`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "99d49291",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.1429, 0.4060, 0.2501],\n",
       "        [0.9211, 0.8689, 0.2593]])"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = torch.rand([2, 3])\n",
    "tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "e06b7422",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[10.1429, 10.4060, 10.2501],\n",
       "        [10.9211, 10.8689, 10.2593]])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor + 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "bcb54d0d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[1.4288, 4.0603, 2.5012],\n",
       "        [9.2109, 8.6893, 2.5926]])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor * 10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ac0de74",
   "metadata": {},
   "source": [
    "## Adding Tensors Together\n",
    "\n",
    "If tensors are the same size you can add them together element-wise\n",
    "\n",
    "If they aren't the same size, tensors are aligned when possible (you can think of this like repeating elements until they fit) to do element-wise addition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "cdcf9848",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([3, 5, 7])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor1 = torch.tensor([1, 2, 3])\n",
    "tensor2 = torch.tensor([2, 3, 4])\n",
    "tensor1 + tensor2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "9fa59f81",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[2, 3, 4],\n",
       "        [3, 4, 5]])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor1 = torch.tensor([\n",
    "    [1, 2, 3],\n",
    "    [2, 3, 4],\n",
    "])\n",
    "tensor2 = torch.tensor([1, 1, 1])\n",
    "tensor1 + tensor2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec751979",
   "metadata": {},
   "source": [
    "## Dot product and matrix multiplication\n",
    "\n",
    "Multiplying two tensors together depends on the dimensionality and size of the tensor, but if both are the same exact size then the dot product can be computed as:\n",
    "\n",
    "`dot_prod = (tensor1 * tensor2).sum()`\n",
    "\n",
    "where `tensor1 * tensor2` returns a tensor of the same size, filled with the per-item product. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "a07432f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "tensor1 = torch.rand(5)\n",
    "tensor2 = torch.rand(5)\n",
    "\n",
    "prod = (tensor1 * tensor2).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18a2eaa7",
   "metadata": {},
   "source": [
    "Matrix multiplication can be done with `torch.matmul`, remember that the size of the resulting tensor will change"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "21acdae2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([2, 10])\n"
     ]
    }
   ],
   "source": [
    "M1 = torch.rand(2, 5)\n",
    "M2 = torch.rand(5, 10)\n",
    "\n",
    "prod = torch.matmul(M1, M2)\n",
    "print(prod.size()) # shape changes: [2, 5] * [5, 10] -> [2, 10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "290c2c91",
   "metadata": {},
   "source": [
    "# Naive Bayes\n",
    "\n",
    "In this section you will implement a document classifier using Multinomial Naive Bayes.\n",
    "\n",
    "You will be provided some helper methods to load the dataset and do some of the easy parts!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "b01d5b5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_text_to_words(text: str) -> List[str]:\n",
    "    \"\"\"\n",
    "    Utility function to split text to words and punctuation.\n",
    "    \"\"\"\n",
    "    return re.findall(r\"\\w+|[^\\w\\s]\", text, re.UNICODE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "f2c56c03",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preprocess text data\n",
    "def preprocess_text(corpus):\n",
    "    \"\"\"\n",
    "    Preprocess corpus (list of documents) by tokenizing each document (converting into words and puncts).\n",
    "    \"\"\"\n",
    "    return [convert_text_to_words(doc) for doc in corpus]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "eefed241",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create vocabulary\n",
    "def build_vocabulary(corpus, max_vocab_size=10000, min_freq=10, remove_stop_words=True):\n",
    "    \"\"\"\n",
    "    Build a vocabulary from the corpus. This method takes in a list of lists of words, and builds a mapping\n",
    "    from each word to a unique ID. This method also accounts for maximum vocab sizes (taking only the most popular words)\n",
    "    and can remove stop words (commonly used words that likely don't contribute to the semantic meaning of a document)\n",
    "    \"\"\"\n",
    "    vocab = set()\n",
    "    word_counts = Counter()\n",
    "    for doc in corpus:\n",
    "        vocab.update(doc)\n",
    "        for word in doc:\n",
    "            if remove_stop_words and word in stop_words:\n",
    "                continue\n",
    "            word_counts[word] += 1\n",
    "            \n",
    "    vocab = sorted(list(vocab), key=lambda v:word_counts[v], reverse=True)\n",
    "    word_to_id = {v:i for i, v in enumerate(vocab[:max_vocab_size]) if word_counts[v] >= min_freq or min_freq < 0}\n",
    "    id_to_word = {i:v for v, i in word_to_id.items()}\n",
    "    return word_to_id, id_to_word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "35a8c2d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dataset\n",
    "def load_dataset(split=\"train\"):\n",
    "    # load the news-classification dataset\n",
    "    categories = None\n",
    "    data = fetch_20newsgroups(subset=split, categories=categories, remove=('headers', 'footers', 'quotes'))\n",
    "    categories = data.target_names\n",
    "    X_raw = data.data\n",
    "    y = data.target\n",
    "    return X_raw, y, categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "d23e6493",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_performance(pred, true, labels=list(range(20))):\n",
    "    # returns a dataframe with accuracies per label\n",
    "    overall_accuracy = (pred == true).float().mean().item()\n",
    "    accs = [[-1, \"Overall\", overall_accuracy]]\n",
    "    for l in labels:\n",
    "        mask = true == l\n",
    "        cur_pred = pred[mask]\n",
    "        cur_true = true[mask]\n",
    "        cur_accuracy = (cur_pred == cur_true).float().mean().item() if mask.sum() > 0 else 0.0\n",
    "#         print(f\"Accuracy on {l} ({categories[l]}): \\t{cur_accuracy:0.2f}\")\n",
    "        accs.append([l, categories[l], cur_accuracy])\n",
    "    return pd.DataFrame(data=accs, columns=['Label #', 'Label', \"Accuracy\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4aa681e",
   "metadata": {},
   "source": [
    "## TODO: vectorize(); convert a corpus into tensors\n",
    "\n",
    "This method should convert a corpus (list of list of words) into count-vectors, creating a 2D tensor with size (corpus-length, vocab-size)\n",
    "\n",
    "Indexing this tensor in the first dimension (e.g. `tensor[i, :]`) would give you a 1D vector with the length of your vocab, where each item is the number of times that word appears in `i`-th document of your corpus.\n",
    "\n",
    "Indexing this tensor in the second dimension (e.g. `tensor[:,j]`) would give you a 1D vector with the length of your corpus, where each item is the number of times the `j`-th word (word with token ID `j`) appears in each document in your corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "26bbfe54",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert documents to feature vectors\n",
    "def vectorize(corpus, word_to_id):\n",
    "    \"\"\"\n",
    "    Convert documents to count-feature vectors using the vocabulary. Your output should have size\n",
    "    (len(corpus), len(word_to_id)) and dtype bfloat16\n",
    "    \n",
    "    Detailed instructions:\n",
    "    1. Initialize a tensor with zeros, size ((corpus_length, word_to_id length)), and dtype int8 - should be one line\n",
    "    2. Iterate over every document in the corpus\n",
    "    3. For every word in the current document - check if it is in your word_to_id (is this word in my vocab? if not ignore)\n",
    "    4. If the word is present, get its index and add 1 in the correct place in your overall vectorization\n",
    "    This will produce a tensor whose values at [i, j] are the # of times the word with ID j appeared in document i\n",
    "    \"\"\"\n",
    "    # Here we initialize a tensor to hold the feature vectors.\n",
    "    vectors = torch.zeros((len(corpus), len(word_to_id)), dtype=torch.bfloat16)\n",
    "    for i, doc in enumerate(corpus):\n",
    "        for word in doc:\n",
    "            if word in word_to_id:\n",
    "                word_index = word_to_id[word]\n",
    "                vectors[i, word_index] += 1\n",
    "    return vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "f9acafb6",
   "metadata": {},
   "outputs": [],
   "source": [
    "fake_corpus = [[\"hello\", \"world\", \"!\"], [\"sports\"]]\n",
    "fake_word_to_id = {'hello':0, 'world':1, \"sports\":2, \"!\":3}\n",
    "your_vectorized = vectorize(fake_corpus, fake_word_to_id)\n",
    "\n",
    "assert your_vectorized.size() == (len(fake_corpus), len(fake_word_to_id))\n",
    "assert your_vectorized.tolist() == [[1, 1, 0, 1], [0, 0, 1, 0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "d990da4e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[1., 1., 0., 1.],\n",
       "        [0., 0., 1., 0.]], dtype=torch.bfloat16)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "your_vectorized"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d05606c0",
   "metadata": {},
   "source": [
    "## Learn the Class and Word Probabilities (and a bit of background)\n",
    "\n",
    "This section is recommended to read as a quick reference on where the equations for class and word probabilities come from. You won't be asked to do these proofs yourself.\n",
    "\n",
    "A bit of background on the naive bayes classifier:\n",
    "\n",
    "Given a document, we want to identify which class it belongs to (for this task each class is mutually exclusive - a document can only belong to one class). We can frame this task as finding the class that maximizes the conditional probability `P( y | d)`. That is, we compute the _probability of this class, given the document_ for every class, and find the class `y*` with the highest likelihood.\n",
    "\n",
    "$$\n",
    "  y^* = \\text{argmax}_{y \\in Y} P(y|d)\n",
    "$$\n",
    "\n",
    "Remember bayes rule:\n",
    "$$\n",
    "  P( a | b ) = \\frac{P(b | a) P(a) }{ P(b) }\n",
    "$$\n",
    "\n",
    "Applying this to our conditional probability `P(y|d)`\n",
    "\n",
    "$$\n",
    "  P( y | d ) = \\frac{P(d | y) P(y)}{ P(d) } \n",
    "$$\n",
    "\n",
    "As is common for bayes-type problems, we choose to ignore the denominator as it doesn't depend on the class and our document is given so `P(d)` doesn't mean much.\n",
    "\n",
    "We therefore often rewrite this `P(y|d)` probability as being proportional to the numerator:\n",
    "\n",
    "$$\n",
    "  P( y | d ) \\propto P(d | y) P(y)\n",
    "$$\n",
    "\n",
    "In plain english you can think of this equation as saying\n",
    "\n",
    "$$\n",
    "  \\text{Probability of class } y \\text{ given document } d \\propto (\\text{Probability the document given the class}) \\times (\\text{Probability of class } y)\n",
    "$$\n",
    "\n",
    "The probability of a given class is fairly straightforward - given a corpus of document-class pairs we can compute `P(y)` by taking the number of occurrences of the `y` label and dividing by the length of the dataset: `P(y) = # occurrences of y / len(dataset)`\n",
    "\n",
    "### Traditional vs Multinomial Naive Bayes\n",
    "\n",
    "#### Traditional\n",
    "The probability of the document given the class, `P(d|y)` is a bit more complicated. \n",
    "\n",
    "In traditional naive bayes we would say a document is composed of binary features indicating whether a word is absent or present. We make a key assumption here: **we assume each 'feature' or word in a document is conditionally independent with the others, given the class**. This is the *naive* assumption in naive bayes. \n",
    "\n",
    "We first start by saying a document is composed of 'features' (in our case, words).\n",
    "\n",
    "$$\n",
    "  d = (x_1, x_2, ... x_n)\n",
    "$$\n",
    "\n",
    "\n",
    "In traditional (non-multinomial) naive bayes this assumption allows us to decompose the `P(d | y)` term into the following:\n",
    "\n",
    "$$\n",
    "  \\text{Traditional Naive Bayes Approach: } P( d | y ) = \\prod_{i=1}^n P(x_i | y)\n",
    "$$\n",
    "\n",
    "#### Multinomial Naive Bayes\n",
    "\n",
    "In multinomial Naive Bayes, we account for the fact that word frequencies matter. Instead of treating each word's presence as a binary event (present or not), we use a bag-of-words (i.e. a count-based) model where we are instead trying to find the probability of the documents _histogram_ of word counts given the class. You can now think of a document as list of _counts_ for each word in your vocab size `|V|`: `d = (f(x_1), f(x_2), ... f(x_|V|))`. For convenience we're defining `f(x_i)` to be the # occurrences of `x_i` in your document, and `x_i` is some element in your vocab (which in turn has size `|V|`).\n",
    "\n",
    "The probability mass function (PMF) of a multinomial distribution gives us the following (where `n` is the number of words in the document, `x_i` is a vocab element, `f(x_i)` is the # of occurrences of `x_i` in your document).\n",
    "\n",
    "\n",
    "$$\n",
    "  P(d|y) = \\frac{n!}{f(x_1)!...f(x_{|V|})!} \\prod_{i=1}^{|V|} P(x_i | y)^{f(x_i)}\n",
    "$$\n",
    "\n",
    "Since our goal is to just to compare the probabilities of different classes (our document is set) we can use the same trick as before and ignore the first term giving us:\n",
    "\n",
    "$$\n",
    "  \\text{Multinomial Naive Bayes Approach: } P( d | y ) \\propto \\prod_{i=1}^{|V|} P(x_i | y)^{f(x_i)}\n",
    "$$\n",
    "\n",
    "You'll notice that this equation involves multiplying _a lot_ of very small probabilities together. This often gives us numerical stability problems (i.e. our numbers get too small, and we end up just storing 0).\n",
    "\n",
    "We usually choose instead to operate in the log-space, giving us:\n",
    "\n",
    "$$\n",
    "  \\text{Multinomial Naive Bayes Approach: } \\log P( d | y ) \\propto \\sum_{i=1}^{|V|} f(x_i) \\log P(x_i | y)\n",
    "$$\n",
    "\n",
    "Putting together the entire classification problem in log-space:\n",
    "\n",
    "$$\n",
    "  \\text{Multinomial Naive Bayes Approach: } \\log P( y | d ) \\propto \\log P(y) + \\sum_{i=1}^{|V|} f(x_i) \\log P(x_i | y)\n",
    "$$\n",
    "\n",
    "Congrats, you now know all the math you need to implement a multinomial naive bayes classifier from scratch!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e674e727",
   "metadata": {},
   "source": [
    "## TODO: train_naive_bayes(); learning class and word probabilities\n",
    "\n",
    "The goal of `train_naive_bayes` is to learn the the class and word probability distributions:\n",
    "\n",
    "Given a corpus of document-class labels, we can compute these probabilities as follows:\n",
    "\n",
    "`P(y)` is just the probability of this class appearing in our dataset. This can be calculated as `# occurrences of y / len(dataset)`. You will create a tensor of size `(num_classes)` to contain these probabilities.\n",
    "\n",
    "`P(d | y)` relies on two terms: \n",
    "\n",
    "    `f(x)` which can be easily computed for each document as `# occurrences of x word`\n",
    "    `P(x_i | y)` which is computed by counting the `# occurences of x word (with class y) / # words total found (with class y)\n",
    "    \n",
    "This last term has numerical stability problems (what if this word was never found with this class, should we ignore all other information?) so we usually apply some laplace smoothing:\n",
    "\n",
    "$$\n",
    "P(x_i | y) = \\frac{\\text{count of } x_i \\text{ in class }y + \\alpha}{\\text{total # words in class } y + \\alpha |V|}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "a18c5d94",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Naive Bayes training\n",
    "def train_naive_bayes(X_train, y_train, vocab_size, num_classes):\n",
    "    \"\"\"\n",
    "    Parameters:\n",
    "    -----------\n",
    "    X_train : torch.Tensor\n",
    "        A tensor of shape (num_samples, vocab_size), where each row represents \n",
    "        the word frequencies (counts) for a document. Each entry X_train[i][j] \n",
    "        contains the count of word j in document i. You can think of this like f(x_j); for document i\n",
    "\n",
    "    y_train : torch.Tensor\n",
    "        A tensor of shape (num_samples,) containing the class labels for each \n",
    "        document in X_train. Each label is an integer in the range [0, num_classes - 1].\n",
    "\n",
    "    vocab_size : int\n",
    "        The size of the vocabulary (total number of unique words in the dataset).\n",
    "\n",
    "    num_classes : int\n",
    "        The total number of unique classes in the dataset.\n",
    "\n",
    "    Returns:\n",
    "    --------\n",
    "    class_probs : torch.Tensor\n",
    "        A tensor of shape (num_classes,) containing the prior probabilities \n",
    "        of each class, calculated as the fraction of documents belonging to each class.\n",
    "\n",
    "    word_probs : torch.Tensor\n",
    "        A tensor of shape (num_classes, vocab_size) containing the conditional \n",
    "        probabilities P(word | class) for each word and class, with Laplace smoothing applied (alpha=1).\n",
    "        \n",
    "    Detailed instructions:\n",
    "    1. Initialize tensors to store class counts and word counts.\n",
    "    2. Iterate through the training dataset:\n",
    "    For each document-class pair:\n",
    "        update the count of this class by 1 - should be one line\n",
    "        For this class, add the f(x) values from this document - should be one line\n",
    "    3. Convert counts to probabilities\n",
    "    Class probabilities are the counts / number of training examples - should be one line\n",
    "    Word probabilities are P(x_i | y), you will need to apply smoothing with alpha=1 - should be one line\n",
    "    \"\"\"\n",
    "    class_counts = torch.zeros(num_classes, dtype=torch.bfloat16)\n",
    "    word_counts = torch.zeros((num_classes, vocab_size), dtype=torch.bfloat16)\n",
    "#     class_one_hot = torch.nn.functional.one_hot(y_train, num_classes).to(dtype=torch.bfloat16)\n",
    "#     class_counts = class_one_hot.sum(dim=0)\n",
    "#     word_counts = class_one_hot.T @ X_train # matmul one hot classes, each \n",
    "    for i in range(len(X_train)):\n",
    "        label = y_train[i]\n",
    "        class_counts[label] += 1\n",
    "        word_counts[label] += X_train[i]\n",
    "    \n",
    "    # Adding smoothing to avoid zero probabilities.\n",
    "    class_probs = class_counts / class_counts.sum()\n",
    "    word_probs = (word_counts + 1) / (word_counts.sum(dim=1, keepdim=True) + vocab_size)\n",
    "    return class_probs, word_probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "dec1fe40",
   "metadata": {},
   "outputs": [],
   "source": [
    "dummy_train_vectors = torch.tensor([[0, 1, 2], [0, 1, 1], [0, 2, 2]])\n",
    "dummy_classes = torch.tensor([0, 1, 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "018372a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "your_class_probs, your_word_probs = train_naive_bayes(dummy_train_vectors, dummy_classes, dummy_train_vectors.size(1), 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "77f2cb2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "class_probs = torch.tensor([2, 1], dtype=torch.bfloat16) / 3\n",
    "word_probs = torch.tensor([[0.1000, 0.4000, 0.5000],\n",
    "         [0.2000, 0.4000, 0.4000]], dtype=torch.bfloat16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "3a9d96f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert(torch.all(your_class_probs == class_probs))\n",
    "assert(torch.all(your_word_probs == word_probs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f9e8d6f",
   "metadata": {},
   "source": [
    "## TODO: predict_naive_bayes(); make a prediction from your learned probabilities\n",
    "\n",
    "At this point you are able to make predictions given a new document!\n",
    "\n",
    "Remember that for every class you want to calculate\n",
    "\n",
    "$$\n",
    "\\log P( y | d ) \\propto \\log P(y) + \\sum_{i=1}^{|V|} \\log P(x_i | y) f(x_i)\n",
    "$$\n",
    "\n",
    "You will find the `torch.argmax` function useful here - it gets the index of the largest item in your tensor, so if you fill a tensor with `P(y|d)` and use argmax you will get the correct prediction!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "a0ca0af6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_naive_bayes(X_test, class_probs, word_probs):\n",
    "    \"\"\"\n",
    "    Parameters:\n",
    "    -----------\n",
    "    X_test : torch.Tensor\n",
    "        A tensor of shape (num_samples, vocab_size), where each row represents \n",
    "        the word frequencies (for a document.\n",
    "\n",
    "    class_probs : torch.Tensor\n",
    "        A tensor of shape (num_classes,) containing the prior probabilities \n",
    "        P(class) for each class.\n",
    "\n",
    "    word_probs : torch.Tensor\n",
    "        A tensor of shape (num_classes, vocab_size) containing the conditional \n",
    "        probabilities P(word | class) for each word and class, learned during training.\n",
    "\n",
    "    Returns:\n",
    "    --------\n",
    "    predictions : torch.Tensor\n",
    "        A tensor of shape (num_samples,) where each entry is the predicted class \n",
    "        for the corresponding document in X_test.\n",
    "\n",
    "    Detailed instructions:\n",
    "    1. Precompute logarithms of probabilities for numerical stability:\n",
    "        - Take the logarithm of `class_probs` to get `log_class_probs`.\n",
    "        - Take the logarithm of `word_probs` to get `log_word_probs`.\n",
    "    2. For all test documents simultaneously:\n",
    "        - Compute the log-probabilities for each class by:\n",
    "            (a) Adding the prior `log_class_probs` to the sum of contributions from present words.\n",
    "            (b) For present words, multiply `X_test` with `log_word_probs` and sum over words.\n",
    "    3. Select the class with the highest log-probability for each document.\n",
    "    \n",
    "    For an extra challenge (and efficiency bonus!) you can try doing this entirely with tensor operations, no for loops\n",
    "    The key here will be finding a way to compute the \\sum f(x_i) P(x_i | y) term in one matrix multiplication\n",
    "    Hint: X_test contains f(x_i) and has shape (# docs, |V|) , and \\log word_probs will have shape (|Y|, |V|); \n",
    "    Your goal will be to create a matrix of shape (# docs, |Y|), on which you can use torch.argmax(, dim=1) to get the predicted \n",
    "    class for each document.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Step 1: Precompute logarithms of probabilities\n",
    "    log_class_probs = torch.log(class_probs)  # Log of prior probabilities P(class)\n",
    "    log_word_probs = torch.log(word_probs)  # Log of P(word | class)\n",
    "\n",
    "    # Step 2: Compute log-probabilities for all classes and documents\n",
    "    # Present words contribution: X_test * log_word_probs (broadcasted for all classes)\n",
    "    # \\log P(y) + \\sum_{i=1}^{|V|} \\log P(x_i | y) f(x_i)\n",
    "    log_probs = log_class_probs + torch.matmul(X_test.to(torch.bfloat16), log_word_probs.T)  # Shape: (num_samples, num_classes)\n",
    "\n",
    "    # Step 3: Predict the class with the highest log probability for each document\n",
    "    predictions = torch.argmax(log_probs, dim=1)  # Shape: (num_samples,)\n",
    "    \n",
    "    return predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "d1a68e6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dummy_test_vectors = torch.tensor([[0, 1, 2], [0, 1, 1], [10, 2, 2]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "6589b7af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([0.6680, 0.3340], dtype=torch.bfloat16),\n",
       " tensor([[0.1001, 0.4004, 0.5000],\n",
       "         [0.2002, 0.4004, 0.4004]], dtype=torch.bfloat16))"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "your_class_probs, your_word_probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "0f900050",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([0, 0, 1])"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "your_predictions = predict_naive_bayes(dummy_test_vectors, your_class_probs, your_word_probs)\n",
    "your_predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "1738747c",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert(torch.all(your_predictions == torch.tensor([0, 0, 1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da2c091e",
   "metadata": {},
   "source": [
    "# Run Naive Bayes on Newsgroups Dataset\n",
    "\n",
    "Here you will apply your classifier to a 'real' dataset of newsgroups documents, classified into different topics: http://qwone.com/~jason/20Newsgroups/\n",
    "\n",
    "The code for loading and preprocessing the dataset is given for you, but the rest will use your methods!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5cb8e1f",
   "metadata": {},
   "source": [
    "### Load the Data and Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "859a1b36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load and preprocess data\n",
    "X_raw_train, y_train, categories = load_dataset(split=\"train\")\n",
    "num_classes = len(set(y_train))\n",
    "y_train = torch.tensor(y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "216c2a79",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_processed = preprocess_text(X_raw_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "7192f89a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build vocabulary and vectorize data\n",
    "vocab, id_to_word = build_vocabulary(X_processed, remove_stop_words=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "755aa0b1",
   "metadata": {},
   "source": [
    "### Convert the training data to tensors (using vectorize())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "46837557",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_vectorized = vectorize(X_processed, vocab) # this may take ~20 seconds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "145ec60e",
   "metadata": {},
   "source": [
    "### 'Train' your classifier with train_naive_bayes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "662041ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train Naive Bayes classifier\n",
    "vocab_size = len(vocab)\n",
    "class_probs, word_probs = train_naive_bayes(X_vectorized, y_train, vocab_size, num_classes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65c52e66",
   "metadata": {},
   "source": [
    "## Evaluate Performance!\n",
    "\n",
    "We'll first evaluate performance on the train set to confirm that our classifier has learned something informative from the original distribution.\n",
    "\n",
    "We'll then assess performance on an unseen test set and compare.\n",
    "\n",
    "As our task is multi-class, simple accuracy does not give us much information about how our classifier is doing for each class. Furth"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a851c03",
   "metadata": {},
   "source": [
    "## TODO: get_performance_f1(); get f1 per class\n",
    "\n",
    "For imbalanced data accuracy often doesn't paint the whole picture - over-predicting the majority label can produce high accuracy without actually learning a useful classifier. As a result for our task we would like to measure f1 in addition to accuracy.\n",
    "\n",
    "Modify the below method to compute the f1 score for each class (https://en.wikipedia.org/wiki/F-score) and return pandas Dataframe.\n",
    "\n",
    "$$\n",
    "    F_1 = \\frac{2 \\text{TP}}{(2 \\text{TP}) + \\text{FP} + \\text{FN}} = \\frac{2 (\\text{precision} \\cdot \\text{recall})}{\\text{precision} + \\text{recall}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "166745c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_performance_f1(pred, true, labels=list(range(20))):\n",
    "    \"\"\"\n",
    "    Compute the F1 score for predictions across all labels, with a breakdown \n",
    "    by label. Returns a DataFrame containing precision, recall, and F1 score \n",
    "    for each label. You don't need to calculate an aggregated F1 for all labels\n",
    "\n",
    "    Parameters:\n",
    "    -----------\n",
    "    pred : torch.Tensor\n",
    "        A tensor of shape (num_samples,) containing the predicted labels.\n",
    "\n",
    "    true : torch.Tensor\n",
    "        A tensor of shape (num_samples,) containing the true labels.\n",
    "\n",
    "    labels : list of int, optional\n",
    "        A list of unique label indices. Defaults to `list(range(20))`.\n",
    "\n",
    "    categories : list of str, optional\n",
    "        A list of category names corresponding to the labels. Defaults to `None`.\n",
    "\n",
    "    Returns:\n",
    "    --------\n",
    "    pd.DataFrame\n",
    "        A DataFrame containing precision, recall, and F1 score for each label.\n",
    "\n",
    "    Detailed instructions:\n",
    "    1. Compute precision, recall, and F1 score for each label:\n",
    "        - For each label `l`:\n",
    "            a. Identify the predictions and true values corresponding to the label.\n",
    "            b. Compute true positives, false positives, and false negatives:\n",
    "                - True Positives (TP): Cases where `pred` and `true` both equal `l`.\n",
    "                - False Positives (FP): Cases where `pred == l` but `true != l`.\n",
    "                - False Negatives (FN): Cases where `true == l` but `pred != l`.\n",
    "            c. Calculate precision, recall, and F1 score:\n",
    "                - Precision = TP / (TP + FP) if TP + FP > 0, else 0.\n",
    "                - Recall = TP / (TP + FN) if TP + FN > 0, else 0.\n",
    "                - F1 Score = 2 * (Precision * Recall) / (Precision + Recall) if Precision + Recall > 0, else 0.\n",
    "    2. Compute the overall F1 score across all labels.\n",
    "    3. Return results as a DataFrame.\n",
    "    \"\"\"\n",
    "    # returns a dataframe with accuracies per label\n",
    "    results = []\n",
    "    for l in labels:\n",
    "        true_mask = true == l\n",
    "        pred_mask = pred == l\n",
    "        \n",
    "        # Compute counts for true positives, false positives, and false negatives\n",
    "        true_positives = (pred_mask & true_mask).sum().item()\n",
    "        false_positives = (pred_mask & ~true_mask).sum().item()\n",
    "        false_negatives = (~pred_mask & true_mask).sum().item()\n",
    "        \n",
    "        # Calculate precision, recall, and F1 score\n",
    "        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0\n",
    "        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0\n",
    "        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0\n",
    "        \n",
    "        results.append([l, categories[l], f1_score])\n",
    "    return pd.DataFrame(data=results, columns=['Label #', 'Label', \"F1\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "97c2d2a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train Accuracy: 0.69\n"
     ]
    }
   ],
   "source": [
    "# Test Naive Bayes classifier\n",
    "y_pred = predict_naive_bayes(X_vectorized, class_probs, word_probs)\n",
    "accuracy = (y_pred == y_train).float().mean().item()\n",
    "\n",
    "print(f\"Train Accuracy: {accuracy:.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "1b86c293",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Label #</th>\n",
       "      <th>Label</th>\n",
       "      <th>F1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>alt.atheism</td>\n",
       "      <td>0.505660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>comp.graphics</td>\n",
       "      <td>0.623277</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>comp.os.ms-windows.misc</td>\n",
       "      <td>0.074189</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>comp.sys.ibm.pc.hardware</td>\n",
       "      <td>0.653686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "      <td>0.686676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>comp.windows.x</td>\n",
       "      <td>0.768670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>misc.forsale</td>\n",
       "      <td>0.700813</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>rec.autos</td>\n",
       "      <td>0.691964</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>rec.motorcycles</td>\n",
       "      <td>0.758051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>rec.sport.baseball</td>\n",
       "      <td>0.764082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>rec.sport.hockey</td>\n",
       "      <td>0.682253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>sci.crypt</td>\n",
       "      <td>0.800363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>sci.electronics</td>\n",
       "      <td>0.739962</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>sci.med</td>\n",
       "      <td>0.771777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>sci.space</td>\n",
       "      <td>0.757666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>soc.religion.christian</td>\n",
       "      <td>0.804992</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>talk.politics.guns</td>\n",
       "      <td>0.778226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>talk.politics.mideast</td>\n",
       "      <td>0.775946</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>talk.politics.misc</td>\n",
       "      <td>0.627624</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>talk.religion.misc</td>\n",
       "      <td>0.593007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Label #                     Label        F1\n",
       "0         0               alt.atheism  0.505660\n",
       "1         1             comp.graphics  0.623277\n",
       "2         2   comp.os.ms-windows.misc  0.074189\n",
       "3         3  comp.sys.ibm.pc.hardware  0.653686\n",
       "4         4     comp.sys.mac.hardware  0.686676\n",
       "5         5            comp.windows.x  0.768670\n",
       "6         6              misc.forsale  0.700813\n",
       "7         7                 rec.autos  0.691964\n",
       "8         8           rec.motorcycles  0.758051\n",
       "9         9        rec.sport.baseball  0.764082\n",
       "10       10          rec.sport.hockey  0.682253\n",
       "11       11                 sci.crypt  0.800363\n",
       "12       12           sci.electronics  0.739962\n",
       "13       13                   sci.med  0.771777\n",
       "14       14                 sci.space  0.757666\n",
       "15       15    soc.religion.christian  0.804992\n",
       "16       16        talk.politics.guns  0.778226\n",
       "17       17     talk.politics.mideast  0.775946\n",
       "18       18        talk.politics.misc  0.627624\n",
       "19       19        talk.religion.misc  0.593007"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_performance_f1(y_pred, y_train, labels=list(range(num_classes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c784cfd",
   "metadata": {},
   "source": [
    "### Load Test Set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "2a57138f",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_raw_test, y_test, _ = load_dataset(split=\"test\")\n",
    "y_test = torch.tensor(y_test)\n",
    "X_processed_test = preprocess_text(X_raw_test)\n",
    "X_vectorized_test = vectorize(X_processed_test, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a11b456",
   "metadata": {},
   "source": [
    "### Evaluate Test Set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "a87916b6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test Accuracy: 0.54\n"
     ]
    }
   ],
   "source": [
    "# Test Naive Bayes classifier\n",
    "y_pred = predict_naive_bayes(X_vectorized_test, class_probs, word_probs)\n",
    "accuracy = (y_pred == y_test).float().mean().item()\n",
    "\n",
    "print(f\"Test Accuracy: {accuracy:.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "dc57b4b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Label #</th>\n",
       "      <th>Label</th>\n",
       "      <th>F1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>alt.atheism</td>\n",
       "      <td>0.345955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>comp.graphics</td>\n",
       "      <td>0.525967</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>comp.os.ms-windows.misc</td>\n",
       "      <td>0.027972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>comp.sys.ibm.pc.hardware</td>\n",
       "      <td>0.539851</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>comp.sys.mac.hardware</td>\n",
       "      <td>0.516264</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>comp.windows.x</td>\n",
       "      <td>0.646796</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>misc.forsale</td>\n",
       "      <td>0.654676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>rec.autos</td>\n",
       "      <td>0.576375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>rec.motorcycles</td>\n",
       "      <td>0.612091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>rec.sport.baseball</td>\n",
       "      <td>0.660976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>rec.sport.hockey</td>\n",
       "      <td>0.562069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>sci.crypt</td>\n",
       "      <td>0.678112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>sci.electronics</td>\n",
       "      <td>0.508475</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>sci.med</td>\n",
       "      <td>0.614743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>sci.space</td>\n",
       "      <td>0.614198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>soc.religion.christian</td>\n",
       "      <td>0.638487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>talk.politics.guns</td>\n",
       "      <td>0.511485</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>talk.politics.mideast</td>\n",
       "      <td>0.640114</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>talk.politics.misc</td>\n",
       "      <td>0.362007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>talk.religion.misc</td>\n",
       "      <td>0.215054</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Label #                     Label        F1\n",
       "0         0               alt.atheism  0.345955\n",
       "1         1             comp.graphics  0.525967\n",
       "2         2   comp.os.ms-windows.misc  0.027972\n",
       "3         3  comp.sys.ibm.pc.hardware  0.539851\n",
       "4         4     comp.sys.mac.hardware  0.516264\n",
       "5         5            comp.windows.x  0.646796\n",
       "6         6              misc.forsale  0.654676\n",
       "7         7                 rec.autos  0.576375\n",
       "8         8           rec.motorcycles  0.612091\n",
       "9         9        rec.sport.baseball  0.660976\n",
       "10       10          rec.sport.hockey  0.562069\n",
       "11       11                 sci.crypt  0.678112\n",
       "12       12           sci.electronics  0.508475\n",
       "13       13                   sci.med  0.614743\n",
       "14       14                 sci.space  0.614198\n",
       "15       15    soc.religion.christian  0.638487\n",
       "16       16        talk.politics.guns  0.511485\n",
       "17       17     talk.politics.mideast  0.640114\n",
       "18       18        talk.politics.misc  0.362007\n",
       "19       19        talk.religion.misc  0.215054"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_performance_f1(y_pred, y_test, labels=list(range(num_classes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63b8ca3e",
   "metadata": {},
   "source": [
    "# K-Fold Cross Validation\n",
    "\n",
    "How do we know if the generalization performance of our model architecture is just due to the random chance of our train/test split? One way to have better estimates of generalization performance is to do _cross validation_, where you test different splits and see how your model performs across all of them.\n",
    "\n",
    "\n",
    "### TODO: K-Fold Cross Validation\n",
    "\n",
    "We will implement **K-Fold Cross Validation** to assess the generalization capabilities of your naive bayes classifier.\n",
    "\n",
    "The general algorithm is as follows:\n",
    "\n",
    "1. Shuffle your dataset (make sure to keep (X, Y) pairs together)\n",
    "2. Split your dataset into K \"folds\" - you will set $k=10$\n",
    "3. For each group:\n",
    "    1. Take this group as the test set and the rest as the train set\n",
    "    2. Train your model\n",
    "    3. Run your model on the test set and record your evaluation metrics (Accuracy and per-class F1)\n",
    "4. Take the average of the evaluation metrics (Accuracy and per-class F1)\n",
    "\n",
    "We have provided the code to load and process the dataset, you just need to implement the above algorithm. Do you notice any differences between your cross-validated metrics and the ones you achieved previously?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "bddb075e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "5d9163dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "c87a1653",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load and preprocess data\n",
    "X_raw_train, y_train, categories = load_dataset(split=\"train\")\n",
    "X_raw_test, y_test, categories = load_dataset(split=\"test\")\n",
    "y_test = torch.tensor(y_test)\n",
    "y_train = torch.tensor(y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "548c80f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X_raw_test + X_raw_train\n",
    "Y = torch.concat([y_test, y_train])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "3321b02f",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_processed = preprocess_text(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "5e3dd073",
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab, id_to_word = build_vocabulary(X_processed, remove_stop_words=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "0955854d",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_vectorized = vectorize(X_processed, vocab) # this may take ~30 seconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "4f1abf17",
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_size = len(vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "cc6bc794",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fold 19 accuracy: 0.5663\n",
      "fold 19 accuracy: 0.6019\n",
      "fold 19 accuracy: 0.5865\n",
      "fold 19 accuracy: 0.5961\n",
      "fold 19 accuracy: 0.5775\n",
      "fold 19 accuracy: 0.6072\n",
      "fold 19 accuracy: 0.5977\n",
      "fold 19 accuracy: 0.5993\n",
      "fold 19 accuracy: 0.5966\n",
      "fold 19 accuracy: 0.5815\n",
      "Average accuracy: 0.5911\n",
      "Average alt.atheism: 0.3971\n",
      "Average comp.graphics: 0.5284\n",
      "Average comp.os.ms-windows.misc: 0.0542\n",
      "Average comp.sys.ibm.pc.hardware: 0.5467\n",
      "Average comp.sys.mac.hardware: 0.5940\n",
      "Average comp.windows.x: 0.6583\n",
      "Average misc.forsale: 0.6778\n",
      "Average rec.autos: 0.6189\n",
      "Average rec.motorcycles: 0.6251\n",
      "Average rec.sport.baseball: 0.6934\n",
      "Average rec.sport.hockey: 0.5906\n",
      "Average sci.crypt: 0.7444\n",
      "Average sci.electronics: 0.6000\n",
      "Average sci.med: 0.7011\n",
      "Average sci.space: 0.6938\n",
      "Average soc.religion.christian: 0.6893\n",
      "Average talk.politics.guns: 0.6516\n",
      "Average talk.politics.mideast: 0.7082\n",
      "Average talk.politics.misc: 0.4394\n",
      "Average talk.religion.misc: 0.3046\n"
     ]
    }
   ],
   "source": [
    "k = 10\n",
    "\n",
    "indices = list(range(X_vectorized.shape[0]))\n",
    "random.shuffle(indices)\n",
    "group_length = int(X_vectorized.shape[0] * (1/k))\n",
    "\n",
    "all_metrics = {}\n",
    "for i in range(k):\n",
    "    start = i * group_length\n",
    "    if i == k - 1:\n",
    "        # lump the remaining into the fold\n",
    "        end = X_vectorized.shape[0]\n",
    "    else:\n",
    "        end = (i+1) * group_length\n",
    "        \n",
    "    cur_indices = indices[start:end]\n",
    "    cur_x_test = X_vectorized[cur_indices]\n",
    "    cur_y_test = Y[cur_indices]\n",
    "\n",
    "    #     cur_indices = set(cur_indices)\n",
    "    train_indices = [i for i in range(X_vectorized.shape[0]) if i not in cur_indices]\n",
    "    cur_x_train = X_vectorized[train_indices]\n",
    "    cur_y_train = Y[train_indices]\n",
    "\n",
    "    class_probs, word_probs = train_naive_bayes(cur_x_train, cur_y_train, vocab_size, num_classes)\n",
    "\n",
    "\n",
    "    y_pred = predict_naive_bayes(cur_x_test, class_probs, word_probs)\n",
    "    accuracy = (y_pred == cur_y_test).float().mean().item()\n",
    "    \n",
    "    f1_df = get_performance_f1(y_pred, cur_y_test, labels=list(range(num_classes)))\n",
    "    cur_metrics = {\n",
    "        'accuracy':accuracy\n",
    "    }\n",
    "    for i in range(num_classes):\n",
    "        row = f1_df.iloc[i]\n",
    "        cur_metrics[row.Label] = row.F1\n",
    "        \n",
    "    print(f\"fold {i} accuracy: {accuracy:.4f}\")\n",
    "    \n",
    "    for key, v in cur_metrics.items():\n",
    "        all_metrics[key] = all_metrics.get(key, []) + [v]\n",
    "\n",
    "        \n",
    "for key, v in all_metrics.items():\n",
    "    avg = sum(v) / len(v)\n",
    "    print(f\"Average {key}: {avg:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "379c2e9a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f3d7742",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5368d2e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "explorestorygen",
   "language": "python",
   "name": "explorestorygen"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}