{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Week 4 - Labs Notebook 1 - Querying NCBI, Manipulating Sequences & Using BLAST @NCBI\n",
    "- October 2023\n",
    "- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)\n",
    "- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will go through the week3 notebook in the week 4 computing lab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up the Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check that the essential libraries are installed\n",
    "%pip install biopython\n",
    "%pip install PrettyTable\n",
    "\n",
    "# Import the Biopython functions we need.\n",
    "# Remember that you can look up BioPython functions here - https://biopython.org/docs/latest/api/Bio.html\n",
    "from Bio import SeqIO\n",
    "from Bio.Blast import NCBIWWW\n",
    "from Bio.Blast import NCBIXML\n",
    "from Bio import Entrez\n",
    "\n",
    "# PrettyTable is a nice package for formatting tables - https://pypi.org/project/prettytable/\n",
    "import prettytable\n",
    "from prettytable import PrettyTable\n",
    "\n",
    "#PLEASE DONT FORGET TO SET YOUR EMAIL THIS IS GOOD ETIQUETTE FOR A SERVICE USER\n",
    "EMAIL = 'some.user@anywhere.com'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1A : Searching for Sequences at NCBI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In this section we're going to be using BioPython to query the NCBI nucleotide database\n",
    "\n",
    "# Perform a search to obtain a list of gene ID's related to LHX2 Homo Sapiens gene\n",
    "gene_name = 'LHX2'\n",
    "species_name = 'Homo sapiens'\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# Here we use the contol keywords to target our search more accurately.\n",
    "# These are very nicely detailed here so you should use this as a reference - https://www.ncbi.nlm.nih.gov/books/NBK49540/\n",
    "search_string = gene_name+\"[Gene] AND \"+species_name+\"[Organism] AND mRNA[Filter] AND RefSeq[Filter]\"\n",
    "\n",
    "# You can replicate this search on the NCBI website\n",
    "# Go to https://www.ncbi.nlm.nih.gov/nucleotide/ and enter the search string:\n",
    "# \"LHX2\"[Gene] AND \"Homo sapiens\"[Organism] AND mRNA[Filter] AND RefSeq[Filter]\"\n",
    "\n",
    "#Now we have a search string to seach for ids\n",
    "handle = Entrez.esearch(db=\"nucleotide\", term=search_string)\n",
    "\n",
    "# This record variable holds the return result(s) from the search, note this can be a list of multiple entries\n",
    "record = Entrez.read(handle)\n",
    "ids = record['IdList']\n",
    "\n",
    "# Check how many results were returned\n",
    "print('Your search ('+search_string+') returned '+str(len(ids))+' hits')\n",
    "\n",
    "# NB this returns the IDs of the sequences not the sequences themselves, so we need to actually pull the sequences using the eFetch function.\n",
    "\n",
    "#TASK ONE - Practice searching for these genes PAX6, SIX3, SHH.\n",
    "#TASK TWO - Try searching for these same genes in Mouse [Mus musculus], Rat [Rattus norvegicus], and Chicken [Gallus galus]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1B : Retreiving Sequences from NCBI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the descriptions for each gene ID and load the DNA sequence of the closest match\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# For each ID returned above pull the database entry and print out the ID, the description, and any official accession IDs from the entry.\n",
    "# This search specifies to look in the NCBI nucleotides database, with the IDs and to return the results in GenBank file format as plain text.\n",
    "# The GenBank file format is very nicely explained here - https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html\n",
    "for gene_id in ids :\n",
    "    handle = Entrez.efetch(db=\"nucleotide\", id=gene_id, rettype=\"gb\", retmode=\"text\")\n",
    "    gene = SeqIO.read(handle, \"genbank\")\n",
    "    handle.close()\n",
    "\n",
    "    print(gene_id,gene.description,gene.annotations['accessions'])\n",
    "\n",
    "# There are lots of annotations to sequence records, but we're interested in accessions only here\n",
    "# TASK THREE - Experiment by just calling gene.annotations without the ['accessions'] selector and see what you get\n",
    "# TASK FOUR - Compare this to the sequence entry for these records on the NCBI website.    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1C : Selecting the Correct Sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the sequence we want for the experiment.\n",
    "# If you remember we searched for RefSeq sequences only by using the RefSeq[Filter] term.\n",
    "# Look at the following table - https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly\n",
    "# This shows that some of the RefSeq entries are high-quality stil, but computationally predicted. You can see that what we really want here\n",
    "# are sequences with accessions beginning \"NM_\" because these are the highest quality (normally) curated protein-coding sequences\n",
    "# we can find these with a simple string search of our result annotations\n",
    "\n",
    "import re\n",
    "\n",
    "for gene_id in ids:\n",
    "    handle = Entrez.efetch(db=\"nucleotide\", id=gene_id, rettype=\"gb\", retmode=\"text\")\n",
    "    gene = SeqIO.read(handle, \"genbank\")\n",
    "    handle.close()\n",
    "    # Here we use the re (regular expression) package to perform a search in the list looking for accession matches to \"NM_\"\n",
    "    r = re.compile(\"NM_\")\n",
    "    hits = list(filter(r.match, gene.annotations['accessions']))\n",
    "    if (hits):\n",
    "        print(gene_id, gene.description, hits[0])\n",
    "\n",
    "# TASK FIVE - Here for simplicity I only take one hit to a RefSeq protein-coding transcript hit, but there could be several. Modfify the code \n",
    "# so that it could store multiple high quality (NM_) sequence IDs if they exist (HINT - make a new list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2A : Exploring Sequence Objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We're now going to take the hard-coded accession NM_004789 and fetch that explictly\n",
    "\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=\"NM_004789\", rettype=\"gb\", retmode=\"text\")\n",
    "gene = SeqIO.read(handle, \"genbank\")\n",
    "handle.close()\n",
    "\n",
    "# Display the sequence details\n",
    "print(gene.id,gene.description,gene.annotations['accessions'])\n",
    "\n",
    "# BioPython is able to access the features of the GenBank format file object\n",
    "\n",
    "# # List all of the sequence feature (this is long) - commented out\n",
    "# for f in gene.features:\n",
    "#     print(f)\n",
    "\n",
    "# Find the coding sequence and translate it\n",
    "for f in gene.features:\n",
    "    if f.type=='CDS':\n",
    "        # show the details of the CDS elements\n",
    "        print('Coding sequence at:',f.location)\n",
    "        # convert the nucleotide coding sequence (remember this is an mRNA sequence record) into the protein sequence\n",
    "        print(f.translate(gene).seq)\n",
    "\n",
    "# This works by first iterating through all of the features in the BioSeq object and looking for ones that have type CDS\n",
    "# In this case there is one CDS entry for the whole coding region\n",
    "# We use the feature object function \"translate\" to convert the nucleotide sequence of the coding segment into a protein sequence.\n",
    "# The way this works is that it operates on the whole BioSeq object and translated the region that corresponds to the feature only.\n",
    "# The final .seq makes it return just the protein sequence as a new clean BioSeq object.\n",
    "\n",
    "# TASK SIX - Try extracting the protein sequences from tasks 1 and 2 above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2B : Working with Other Parts of Sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can export other parts of the sequence too.\n",
    "\n",
    "# Simply print the whole mRNA sequence\n",
    "print ('The whole mRNA molecule is '+str(len(gene.seq))+' nucleotides long')\n",
    "print(gene.seq)\n",
    "\n",
    "# Find the coding sequence location again\n",
    "for f in gene.features:\n",
    "    if f.type=='CDS':\n",
    "        # find the location of the coding sequence\n",
    "        coding_location = f.location\n",
    "\n",
    "# Because our sequence is an mRNA molecule anything that is downstream of the coding sequence is called the 3'-UTR (un-translated region)\n",
    "# This region can have interesting and sometimes highly conserved sequences in that can do things like target an mRNA molecule to a particular \n",
    "# sub-cellular compartment.\n",
    "# We can extract the 3-UTR by taking all the sequence after the coding sequence\n",
    "\n",
    "seq3utr = gene.seq[coding_location.end:]\n",
    "\n",
    "print('The 3-UTR is '+str(len(seq3utr))+' nucleotides long')\n",
    "print(seq3utr)\n",
    "\n",
    "# TASK SEVEN - Try extracting the 5'-UTR (this is the part of the mRNA molecule that is upstream of the coding sequence), how long is it?\n",
    "# TASK EIGHT - Try finding the lengths of the 5' and 3- UTRs of the sequences from tasks 1 and 2, above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3A : Runing BLAST and Parsing Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we're familiar with how to extract various bits from sequence objects that we download we can begin to use them as tools\n",
    "# We're going to run a remote BLAST job at NCBI using BioPython\n",
    "\n",
    "# To continue the example we are going to use the 3'-UTR sequence from LHX2 that we isolated above as the query sequence.\n",
    "\n",
    "# We have many choices for the target database:\n",
    "# nt - the full database\n",
    "# refseq_rna - Curated (NM_, NR_) plus predicted (XM_, XR_) sequences from NCBI Reference Sequence Project\n",
    "# refseq_genomic - Genomic sequences from NCBI Reference Sequence Project\n",
    "# for available databases, see:\n",
    "# ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf\n",
    "\n",
    "# Set the database\n",
    "database = 'refseq_rna'\n",
    "# Specify our query, this is the simplest type - a blastn algorithm search of the refseq_rna database using the LHX2 3-UTR sequence\n",
    "result_handle = NCBIWWW.qblast(\"blastn\", database, seq3utr)\n",
    "blast_record = NCBIXML.read(result_handle)\n",
    "\n",
    "# NB These searches can take a while depending on how busy the servers are so you may need to wait for c.5 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NB the search above is simple, you can pass all the parameters to the search by modifying the qblast query string, for example:-\n",
    "\n",
    "# > blast_handle = NCBIWWW.qblast(\"blastp\", \"nr\",\n",
    "#                               peptide_seq,\n",
    "#                               expect=200000.0,\n",
    "#                               filter=False,\n",
    "#                               word_size=2,\n",
    "#                               composition_based_statistics=False,\n",
    "#                               matrix_name=\"PAM30\",\n",
    "#                               gapcosts=\"9 1\",\n",
    "#                               hitlist_size=1000)\n",
    "\n",
    "# You can choose from pretty much all of the BLAST command line parameters which are listed here - https://www.ncbi.nlm.nih.gov/books/NBK279684/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here we provide a small function that will take the results returned from your BLAST queries and formats it into a nice table\n",
    "\n",
    "def result_handle_table(blast_record , query_len) :\n",
    "    \n",
    "    t = PrettyTable(['Description', 'Max Score' , 'Total Score' , 'Query Cover' , 'E Value' ,'Per Ident' , 'Acc Len' , 'Accession' ])\n",
    "    t._max_width = {\"Description\" : 30}\n",
    "    for alignment in blast_record.alignments: \n",
    "        score = 0\n",
    "        max_score = 0\n",
    "        query_cover = 0\n",
    "        perc_ident = 100\n",
    "        for hsp in alignment.hsps :\n",
    "            score = score + hsp.bits\n",
    "            max_score = max(max_score , hsp.bits)\n",
    "            query_cover += ((hsp.query_end - hsp.query_start) / query_len)\n",
    "            perc_ident = min(perc_ident , hsp.identities/hsp.align_length)\n",
    "        if hsp.expect < 0.05 : \n",
    "            t.add_row([alignment.hit_def.split('>')[0] , round(score) , round(max_score) , '{:3.0f}%'.format(query_cover*100) , '{:1.2e}'.format(hsp.expect) , '{:3.2f}%'.format(perc_ident*100)  , alignment.length  , alignment.accession])\n",
    "    \n",
    "    return print(t[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can use the function like this (below) but make sure you've already run the cell above first or it won't know the function.\n",
    "result_handle_table(blast_record , len(seq3utr))\n",
    "\n",
    "# TASK NINE - Practice a variety of different search strategies with remote BLAST including adding additional parameters to control the search\n",
    "# TASK TEN - For at least some of the searches compare the tables produced from the function above to those from the search returned at the NCBI website."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3B : Taking Advantage of NCBI Taxonomy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NCBI have a tremendous Taxonomy system that allows you to both specificlaly search based on species and to determine the species\n",
    "# of sequences/results you have returned.\n",
    "\n",
    "# Perform a search to obtain the taxid (taxonomic id - the equivalent of an accession number but for a species) for homo sapiens\n",
    "organism = 'Homo sapiens' \n",
    "search_string = organism\n",
    "\n",
    "#Now we have a search string to seach for ids\n",
    "handle = Entrez.esearch(db=\"taxonomy\", term=search_string)\n",
    "record = Entrez.read(handle)\n",
    "ids = record['IdList']\n",
    "print(ids)\n",
    "\n",
    "# TASK ELEVEN - Try retrieving taxonomic ids for several other species, check them on the NCBI Taxonomy website - https://www.ncbi.nlm.nih.gov/taxonomy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We create a BLAST query in just the same way as above, but this time we are trying a more complicated strategy\n",
    "# Here we try a tblastx strtegy starting with the 3'-UTR of LHX2 (non-coding remember)\n",
    "# The system will perform a 6-frame translation of the nucleotide query and of the nucletoide RefSeq RNA target database\n",
    "# It will try to find \"protein\" matches between the two.\n",
    "\n",
    "# Database is set to RefSeq_RNA\n",
    "database = 'refseq_rna'\n",
    "\n",
    "# We next stitch together a query for a particular species in this case:\n",
    "# \"txid9606 [ORGN]' using the taxid we found for humans in the cell above.\n",
    "# NB the slightly odd composition of this query term the organism specification is of the form \"txidXXXX[ORGN]\" where XXXX is the taxonommic ID number\n",
    "# So for humans this would be 'txid9606[ORGN]'\n",
    "organism = 'txid' + ids[0] + '[ORGN]'\n",
    "\n",
    "# We now perform the BLAST search speficially limiting it to a given species. This would be very useful if you wanted to to species \n",
    "# specific BLAST searches.\n",
    "result_handle = NCBIWWW.qblast(\"tblastx\",\n",
    "            database,\n",
    "            seq3utr,\n",
    "            gapcosts = '11 1',\n",
    "            entrez_query = organism,\n",
    "            expect=0.05,\n",
    "            word_size=3,\n",
    "            matrix_name=\"BLOSUM62\",\n",
    "            hitlist_size=100)\n",
    "\n",
    "blast_record = NCBIXML.read(result_handle)\n",
    "\n",
    "# NB This is likely to be a very time-consuming search as it is a tblastx (36-way) query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_handle_table(blast_record , len(seq3utr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 4 : BLAST Challenges\n",
    "\n",
    "Finally you can try to put together a strategy to address the questions below using the techniques you've been practicing above\n",
    "\n",
    "* Perform a protein blast search to identify another organism with a fully function LHX2 gene\n",
    "* Identify the function of the LHX2 gene\n",
    "* Look for distantly related sequences and confirm in literature if this organism has a functioning LHX2 gene\n",
    "\n",
    "Tips for challenges involve performing a range of different BLAST searches for example:\n",
    "- searching different databases\n",
    "- using different BLAST algorithms\n",
    "- using different search parameters (such as different substitution matrices, remember the purposes of differnt PAM and BLOSUM matrices)\n",
    "- try some of the more exotic BLAST searches at the NCBI webite such as PSI-BLAST and DELTA-BLAST"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.8 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "c79478e135452d4f8dcea3898ce85a4457be8d06848dc07bbec8d2854f4ceed7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}