{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2 - Accessing & Working with DNA, RNA & Protein Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will start working with biological sequences by retreiving records, looking at their structure and the information that is associated with them. We will also start manipulating the sequences and performing some basic analysis to become more familiar with the sorts of operations and processes we can perform.\n",
    "\n",
    "We have included web links were appropriate to additional information and web based resrouces that can be used to either replace or complement working in the Python environment. It is absolutely fine to use web based tools to perform Bioinformatic work, but those tools are often limited in their functionality in ways that eventually become problematic in real-life anaysis situations. This is why, if you would like to pursue further study and/or research in Bioinformatics and related disciplines it is a good plan to begin learning the two core programming languages that are in common use, namely [Python](https://www.learnpython.org) and the Statistical programming language [R](https://cran.r-project.org)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# install and/or load BioPython\n",
    "%pip install biopython\n",
    "\n",
    "# replace this with your e-mail address\n",
    "EMAIL = 'A.N.Other@example.com'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we load the Entrez module from BioPython.\n",
    "\n",
    "You can read the description of this module [here](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# note the egquery function provides Entrez database counts from a global search.\n",
    "handle = Entrez.egquery(term=\"Cypripedioideae\")\n",
    "record = Entrez.read(handle)\n",
    "handle.close()\n",
    "\n",
    "print(type(record))\n",
    "\n",
    "# Look at what is inside the record object\n",
    "print(record.keys())\n",
    "\n",
    "# The first contains the search term\n",
    "print(record['Term'])\n",
    "\n",
    "# The second contains a list of results from different Entrez Databases\n",
    "for row in record['eGQueryResult']:\n",
    "    print(row)\n",
    "\n",
    "# we can iterate through the record and only return the 'nucleotide' result\n",
    "for row in record[\"eGQueryResult\"]:\n",
    "    if row[\"DbName\"]==\"nuccore\":\n",
    "        print('***',row)\n",
    "        # print just how many nucleotide entries there are\n",
    "        print(row[\"Count\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the number of nucleotide sequences returned and compare it to the result you get if you seach for \"Cypripedioideae\" using the [Entrez Search Webpage](https://www.ncbi.nlm.nih.gov/search/). For interest, these are a sub-family of Orchid (one member is the [Lady's Slipper Orchid](https://en.wikipedia.org/wiki/Cypripedium_calceolus))\n",
    "\n",
    "Lets now select a particular sequence and download it for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# we're going to search for up to 1000 sequences and we're going to ask for the accession number for each\n",
    "\n",
    "# note the Entrez esearch function searches and returns a handle to the results.\n",
    "handle = Entrez.esearch(db='nucleotide',term=\"Cypripedioideae\",retmax=1000,idtype='acc')\n",
    "record = Entrez.read(handle)\n",
    "handle.close()\n",
    "\n",
    "#look at the first 10 ids\n",
    "print(record['IdList'][:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#lets fetch one\n",
    "accession = record['IdList'][500]\n",
    "\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=accession, retmode=\"xml\")\n",
    "entry = Entrez.read(handle)\n",
    "handle.close()\n",
    "\n",
    "#print the whole entry (this is a GenBank record in XML format)\n",
    "print(entry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(entry[0]['GBSeq_definition'])\n",
    "print(entry[0]['GBSeq_organism'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can retreive the record in a more user-friendly format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "handle = Entrez.efetch(db=\"nuccore\", id=accession, rettype=\"gb\", retmode=\"text\")\n",
    "print(handle.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the Bio.SeqIO module which handles groups of records to capture the search and create a Bio.Seq.Seq sequence object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "handle = Entrez.efetch(db=\"nuccore\", id=accession, rettype=\"gb\", retmode=\"text\")\n",
    "records = SeqIO.parse(handle, \"gb\")\n",
    "\n",
    "for entry in records:\n",
    "    sequence = entry.seq\n",
    "    print(sequence)\n",
    "    print(type(sequence))\n",
    "    \n",
    "print('complement',sequence.complement())\n",
    "print('reverse_complement',sequence.reverse_complement())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The real power of this system comes when you want to search and work with a lot of sequences.\n",
    "\n",
    "Lets say we want to search for Gene entries for Pax6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#search for\n",
    "\n",
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# we're going to limit this to 100 sequences and we're going to ask for the accession number for each\n",
    "\n",
    "# note the Entrez esearch function searches and returns a handle to the results.\n",
    "handle = Entrez.esearch(db='nucleotide',term=\"Pax6[Gene]\",retmax=100)\n",
    "record = Entrez.read(handle)\n",
    "handle.close()\n",
    "\n",
    "#look at the first 10 ids\n",
    "print(record['IdList'][:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# now lets fetch them all, to do this we extract the accession id list\n",
    "\n",
    "gi_list = record['IdList']\n",
    "\n",
    "#then turn it into a comma-separated string\n",
    "\n",
    "gi_str = \",\".join(gi_list)\n",
    "\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=gi_str, rettype=\"gb\", retmode=\"text\")\n",
    "records = SeqIO.parse(handle, \"gb\")\n",
    "\n",
    "for record in records:\n",
    "    print(\"%s, length %i, from organism %s\" % (record.name, len(record), record.description))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to pull a full gene entry for human Pax6 from Genbank and look at it, we can also do this online by clicking [here](https://www.ncbi.nlm.nih.gov/nuccore/208879460)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "outputPrepend"
    ]
   },
   "outputs": [],
   "source": [
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=\"208879460\", rettype=\"gb\", retmode=\"text\")\n",
    "gb_entry = handle.read()\n",
    "handle.close()\n",
    "\n",
    "#NB this is just a straight string at this point (as we just read() it straight into a string object)\n",
    "print(gb_entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to extract the coding sequence from this entry and translate it into protein"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=\"208879460\", rettype=\"gb\", retmode=\"text\")\n",
    "record = SeqIO.read(handle, \"genbank\")\n",
    "\n",
    "if record.features:\n",
    "    for feature in record.features:\n",
    "        #this tag identifies the CoDingSequences from the record\n",
    "        if feature.type == \"CDS\":\n",
    "            print(feature.qualifiers[\"protein_id\"])\n",
    "            print(feature.location,'\\n')\n",
    "            current_sequence = feature.location.extract(record).seq\n",
    "            print('Nucleotide Sequence')\n",
    "            print(current_sequence,'\\n')\n",
    "            #translate the current sequence into protein\n",
    "            print('Protein Sequence')\n",
    "            print(current_sequence.translate(),'\\n')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "outputPrepend"
    ]
   },
   "outputs": [],
   "source": [
    "from Bio import Entrez\n",
    "\n",
    "Entrez.email = EMAIL\n",
    "\n",
    "# note the Entrez esearch function searches and returns a handle to the results.\n",
    "handle = Entrez.esearch(db='gene',term=\"Nrg1[Gene] AND human\",retmax=100)\n",
    "record = Entrez.read(handle)\n",
    "handle.close()\n",
    "\n",
    "#look at the first 10 ids\n",
    "print(record['IdList'][:10])\n",
    "\n",
    "handle = Entrez.efetch(db=\"gene\", id=record['IdList'][:1],retmode='xml')\n",
    "gb_enrty = handle\n",
    "handle.close()\n",
    "\n",
    "#NB this is just a straight string at this point (as we just read() it straight into a string object)\n",
    "print(type(gb_entry))\n",
    "\n",
    "print(gb_entry)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Challenge 1 - Finding Genes with NCBI-Entrez\n",
    "Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive entries for a gene called Nrg1.\n",
    "- How many different gene entries are there for this gene in NCBI databases?\n",
    "- What is the full name of this gene?\n",
    "- What kind of protein does this gene encode?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Challenge 2 - Human and Mouse Nrg1 Genes\n",
    "Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive full-length human and mouse (RefSeq) gene entries for Nrg1.\n",
    "- What are the accession numbers / ids of the Genbank records?\n",
    "- How long are the Human and Mouse NRG1, Nrg1 proteins?\n",
    "- How many nucleotide sequence differences are there between their longest CDs?\n",
    "- How many protein sequence differences are there between their longest proteins?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.3 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "c79478e135452d4f8dcea3898ce85a4457be8d06848dc07bbec8d2854f4ceed7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}