{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Week 4 - Labs Notebook 2 - Multiple Sequence Alignment\n",
    "- October 2023\n",
    "- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)\n",
    "- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we are going to walk through an experiment where we retreive sequences, do BLAST searches and then use the results to format and then execute a multiple sequence alignment using the MUSCLE software package. At the end we even create a basic phylogenetic tree from the alignment and then visualise it.\n",
    "\n",
    "The setup of the MUSCLE aligner is a little technical so we strongly reccommend that if you want to use it you install it using conda using the command ``conda install -c bioconda muscle``"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1 : Settting Up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install biopython\n",
    "\n",
    "# import required Biophython functions \n",
    "from Bio import Entrez\n",
    "from Bio.Blast import NCBIXML\n",
    "from Bio.Blast import NCBIWWW\n",
    "from Bio import SeqIO\n",
    "from Bio.Seq import Seq\n",
    "from Bio.SeqRecord import SeqRecord\n",
    "from Bio.Align.Applications import MuscleCommandline\n",
    "from Bio import AlignIO\n",
    "from Bio.Align import AlignInfo\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2 : Obtaining Sequences from NCBI\n",
    "\n",
    "We're going to do this with Protein sequences, best to go to NCBI on the web to make sure you call for the correct sequence. The sequence accession should strt with \"NP_\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Entrez.email = 'ian.simpson@ed.ac.uk'\n",
    "\n",
    "# beta-globin, human\n",
    "my_protein = 'NP_000509.1' \n",
    "\n",
    "handle = Entrez.efetch(db=\"protein\", id=my_protein, rettype=\"gb\", retmode=\"text\")\n",
    "record = SeqIO.read(handle, \"genbank\")\n",
    "handle.close()\n",
    "\n",
    "# show the sequence record\n",
    "print(record)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3 : Find Related Sequences Using BLAST"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle = NCBIWWW.qblast('blastp', 'swissprot', record.seq)\n",
    "# This may take some time to run\n",
    "\n",
    "# parse the results\n",
    "result_handle.seek(0)\n",
    "blast_record = NCBIXML.read(result_handle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see the data structure for the results, go [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc94)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#print out the key information for the hits\n",
    "\n",
    "print('Gene name\\te-value\\tscore')\n",
    "for a in blast_record.alignments:\n",
    "    print(a.title.split('|')[2].split('Full=')[1].split(';')[0]+'\\t'+str(a.hsps[0].expect)+'\\t'+str( a.hsps[0].score))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# show the species and alignment scores\n",
    "a=blast_record.alignments[0]\n",
    "sp_ids = []\n",
    "for a in blast_record.alignments:\n",
    "    sp_ids.append(a.title.split('|')[1])\n",
    "# print(\",\".join(sp_ids))\n",
    "handle = Entrez.efetch(db=\"protein\", id=\",\".join(sp_ids), retmode=\"xml\")#, rettype='gb')\n",
    "data = Entrez.read(handle)\n",
    "species = []\n",
    "print('Alignment score\\tSpecies')\n",
    "for i,d in enumerate(data):\n",
    "    species.append(d['GBSeq_source'])\n",
    "    print(str(blast_record.alignments[i].hsps[0].score)+'\\t'+d['GBSeq_source'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part Four : Select Sequences for Multiple Sequence Alignment\n",
    "\n",
    "Select sequences based on an e-value threshold then for each selected sequence print out:-\n",
    "- name of alignment\n",
    "- length of alignment\n",
    "- e-value\n",
    "- Query sequence\n",
    "- Matching sequence\n",
    "- Alignment info\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "E_VALUE_THRESH = 1e-6\n",
    "\n",
    "for i,alignment in enumerate(blast_record.alignments):\n",
    "    for hsp in alignment.hsps:\n",
    "        if hsp.expect < E_VALUE_THRESH:\n",
    "            print('****Alignment****')\n",
    "            print('sequence: ', alignment.title)\n",
    "            print('species: '+species[i])\n",
    "            print('length: ', alignment.length)\n",
    "            print('e value: ', hsp.expect)\n",
    "            print(hsp.query[0:75] + '...')\n",
    "            print(hsp.sbjct[0:75] + '...')\n",
    "            print(hsp.match[0:75] + '...')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part Five : Create FASTA File of Sequences for Alignment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now work with all results with e-value below this value:\n",
    "E_VALUE_THRESH = 1e-6\n",
    "\n",
    "# the following will write all results into a FASTA file for the MSA \n",
    "\n",
    "def get_seqrecs(alignments, threshold):\n",
    "    # a little helper function to get the sequence records\n",
    "    for i,aln in enumerate(alignments):\n",
    "        for hsp in aln.hsps:\n",
    "            if hsp.expect < threshold:\n",
    "                id = species[i]\n",
    "                # id = aln.title.split('|')[1].split(' ')[0].split('_')[0]+'_'+species[i].replace(' ','_')\n",
    "                print(id)\n",
    "                yield SeqRecord(Seq(hsp.sbjct), id=id,description=str(aln.title.split('|')[1]))\n",
    "                break\n",
    " \n",
    "best_seqs = get_seqrecs(blast_record.alignments, E_VALUE_THRESH)\n",
    "# write out to a fasta file\n",
    "SeqIO.write(best_seqs, 'blast_selected_globins.fa', 'fasta')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part Six : Run MUSCLE Alignment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# run Muscle MSA\n",
    "cmdLine = 'muscle -align blast_selected_globins.fa -output blast_selected_globins_alignment.aln'\n",
    "os.popen(cmdLine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#read in and then print out alignment\n",
    "alignment = AlignIO.read('blast_selected_globins_alignment.aln','fasta')\n",
    "print(alignment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary_align = AlignInfo.SummaryInfo(alignment)\n",
    "\n",
    "# compute a consensus sequence by taking the most frequent letter\n",
    "# positions below a thresold similarity are shown as 'X'\n",
    "\n",
    "# the threshold can be adjusted by adding e.g. threshold=0.5\n",
    "\n",
    "print('Consensus sequence without gaps:')\n",
    "print(summary_align.dumb_consensus())\n",
    "print('Consensus sequence with gaps:')\n",
    "print(summary_align.gap_consensus())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print a Position Specific Score Matrix (PSSM)\n",
    "# this shows the number of letters counted at each locationa./mu    \n",
    "# in the sequence, which is shown in vertical along the left\n",
    "pssm = summary_align.pos_specific_score_matrix(summary_align.dumb_consensus(), chars_to_ignore = ['X'])\n",
    "print(pssm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part Seven : Distantly Related Globins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run Muscle MSA\n",
    "import os\n",
    "\n",
    "# run Muscle MSA\n",
    "cmdLine = 'muscle -align globins.fa -output distant_globins.aln'\n",
    "os.popen(cmdLine)\n",
    "\n",
    "#read in and then print out alignment\n",
    "alignment = AlignIO.read('distant_globins.aln','fasta')\n",
    "\n",
    "#print out the alignment, note I am printing it using the CLUSTAL X file format, this is still a MUSCLE alignment\n",
    "print(format(alignment,'clustal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 8 : Build a UPGMA tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Phylo.TreeConstruction import DistanceCalculator\n",
    "#from TreeConstruction import DistanceCalculator\n",
    "calculator = DistanceCalculator('identity')\n",
    "dm = calculator.get_distance(alignment)\n",
    "print(dm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio.Phylo.TreeConstruction import DistanceTreeConstructor\n",
    "#from TreeConstruction import DistanceTreeConstructor\n",
    "# here supply the keyword upgma or nj\n",
    "# compare the trees you get from both methods\n",
    "constructor = DistanceTreeConstructor(calculator, 'upgma')\n",
    "tree = constructor.build_tree(alignment)\n",
    "print(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from Bio import Phylo\n",
    "# now draw the tree, try out these three methods:\n",
    "Phylo.draw_ascii(tree)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# or a nicer looking one\n",
    "plt.figure(figsize=(12,12))\n",
    "ax=plt.subplot(111)\n",
    "Phylo.draw(tree,axes=ax)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "3780209f5c547b790e301ddcf82da1791fd2d3a86603b11e6f996468095ce25a"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}