{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Week 4 - Labs Notebook 1 - Querying NCBI, Manipulating Sequences & Using BLAST @NCBI\n", "- October 2023\n", "- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)\n", "- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will go through the week3 notebook in the week 4 computing lab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up the Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check that the essential libraries are installed\n", "%pip install biopython\n", "%pip install PrettyTable\n", "\n", "# Import the Biopython functions we need.\n", "# Remember that you can look up BioPython functions here - https://biopython.org/docs/latest/api/Bio.html\n", "from Bio import SeqIO\n", "from Bio.Blast import NCBIWWW\n", "from Bio.Blast import NCBIXML\n", "from Bio import Entrez\n", "\n", "# PrettyTable is a nice package for formatting tables - https://pypi.org/project/prettytable/\n", "import prettytable\n", "from prettytable import PrettyTable\n", "\n", "#PLEASE DONT FORGET TO SET YOUR EMAIL THIS IS GOOD ETIQUETTE FOR A SERVICE USER\n", "EMAIL = 'some.user@anywhere.com'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1A : Searching for Sequences at NCBI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# In this section we're going to be using BioPython to query the NCBI nucleotide database\n", "\n", "# Perform a search to obtain a list of gene ID's related to LHX2 Homo Sapiens gene\n", "gene_name = 'LHX2'\n", "species_name = 'Homo sapiens'\n", "\n", "Entrez.email = EMAIL\n", "\n", "# Here we use the contol keywords to target our search more accurately.\n", "# These are very nicely detailed here so you should use this as a reference - https://www.ncbi.nlm.nih.gov/books/NBK49540/\n", "search_string = gene_name+\"[Gene] AND \"+species_name+\"[Organism] AND mRNA[Filter] AND RefSeq[Filter]\"\n", "\n", "# You can replicate this search on the NCBI website\n", "# Go to https://www.ncbi.nlm.nih.gov/nucleotide/ and enter the search string:\n", "# \"LHX2\"[Gene] AND \"Homo sapiens\"[Organism] AND mRNA[Filter] AND RefSeq[Filter]\"\n", "\n", "#Now we have a search string to seach for ids\n", "handle = Entrez.esearch(db=\"nucleotide\", term=search_string)\n", "\n", "# This record variable holds the return result(s) from the search, note this can be a list of multiple entries\n", "record = Entrez.read(handle)\n", "ids = record['IdList']\n", "\n", "# Check how many results were returned\n", "print('Your search ('+search_string+') returned '+str(len(ids))+' hits')\n", "\n", "# NB this returns the IDs of the sequences not the sequences themselves, so we need to actually pull the sequences using the eFetch function.\n", "\n", "#TASK ONE - Practice searching for these genes PAX6, SIX3, SHH.\n", "#TASK TWO - Try searching for these same genes in Mouse [Mus musculus], Rat [Rattus norvegicus], and Chicken [Gallus galus]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1B : Retreiving Sequences from NCBI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the descriptions for each gene ID and load the DNA sequence of the closest match\n", "\n", "Entrez.email = EMAIL\n", "\n", "# For each ID returned above pull the database entry and print out the ID, the description, and any official accession IDs from the entry.\n", "# This search specifies to look in the NCBI nucleotides database, with the IDs and to return the results in GenBank file format as plain text.\n", "# The GenBank file format is very nicely explained here - https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html\n", "for gene_id in ids :\n", " handle = Entrez.efetch(db=\"nucleotide\", id=gene_id, rettype=\"gb\", retmode=\"text\")\n", " gene = SeqIO.read(handle, \"genbank\")\n", " handle.close()\n", "\n", " print(gene_id,gene.description,gene.annotations['accessions'])\n", "\n", "# There are lots of annotations to sequence records, but we're interested in accessions only here\n", "# TASK THREE - Experiment by just calling gene.annotations without the ['accessions'] selector and see what you get\n", "# TASK FOUR - Compare this to the sequence entry for these records on the NCBI website. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1C : Selecting the Correct Sequences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the sequence we want for the experiment.\n", "# If you remember we searched for RefSeq sequences only by using the RefSeq[Filter] term.\n", "# Look at the following table - https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly\n", "# This shows that some of the RefSeq entries are high-quality stil, but computationally predicted. You can see that what we really want here\n", "# are sequences with accessions beginning \"NM_\" because these are the highest quality (normally) curated protein-coding sequences\n", "# we can find these with a simple string search of our result annotations\n", "\n", "import re\n", "\n", "for gene_id in ids:\n", " handle = Entrez.efetch(db=\"nucleotide\", id=gene_id, rettype=\"gb\", retmode=\"text\")\n", " gene = SeqIO.read(handle, \"genbank\")\n", " handle.close()\n", " # Here we use the re (regular expression) package to perform a search in the list looking for accession matches to \"NM_\"\n", " r = re.compile(\"NM_\")\n", " hits = list(filter(r.match, gene.annotations['accessions']))\n", " if (hits):\n", " print(gene_id, gene.description, hits[0])\n", "\n", "# TASK FIVE - Here for simplicity I only take one hit to a RefSeq protein-coding transcript hit, but there could be several. Modfify the code \n", "# so that it could store multiple high quality (NM_) sequence IDs if they exist (HINT - make a new list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2A : Exploring Sequence Objects" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We're now going to take the hard-coded accession NM_004789 and fetch that explictly\n", "\n", "handle = Entrez.efetch(db=\"nucleotide\", id=\"NM_004789\", rettype=\"gb\", retmode=\"text\")\n", "gene = SeqIO.read(handle, \"genbank\")\n", "handle.close()\n", "\n", "# Display the sequence details\n", "print(gene.id,gene.description,gene.annotations['accessions'])\n", "\n", "# BioPython is able to access the features of the GenBank format file object\n", "\n", "# # List all of the sequence feature (this is long) - commented out\n", "# for f in gene.features:\n", "# print(f)\n", "\n", "# Find the coding sequence and translate it\n", "for f in gene.features:\n", " if f.type=='CDS':\n", " # show the details of the CDS elements\n", " print('Coding sequence at:',f.location)\n", " # convert the nucleotide coding sequence (remember this is an mRNA sequence record) into the protein sequence\n", " print(f.translate(gene).seq)\n", "\n", "# This works by first iterating through all of the features in the BioSeq object and looking for ones that have type CDS\n", "# In this case there is one CDS entry for the whole coding region\n", "# We use the feature object function \"translate\" to convert the nucleotide sequence of the coding segment into a protein sequence.\n", "# The way this works is that it operates on the whole BioSeq object and translated the region that corresponds to the feature only.\n", "# The final .seq makes it return just the protein sequence as a new clean BioSeq object.\n", "\n", "# TASK SIX - Try extracting the protein sequences from tasks 1 and 2 above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2B : Working with Other Parts of Sequences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can export other parts of the sequence too.\n", "\n", "# Simply print the whole mRNA sequence\n", "print ('The whole mRNA molecule is '+str(len(gene.seq))+' nucleotides long')\n", "print(gene.seq)\n", "\n", "# Find the coding sequence location again\n", "for f in gene.features:\n", " if f.type=='CDS':\n", " # find the location of the coding sequence\n", " coding_location = f.location\n", "\n", "# Because our sequence is an mRNA molecule anything that is downstream of the coding sequence is called the 3'-UTR (un-translated region)\n", "# This region can have interesting and sometimes highly conserved sequences in that can do things like target an mRNA molecule to a particular \n", "# sub-cellular compartment.\n", "# We can extract the 3-UTR by taking all the sequence after the coding sequence\n", "\n", "seq3utr = gene.seq[coding_location.end:]\n", "\n", "print('The 3-UTR is '+str(len(seq3utr))+' nucleotides long')\n", "print(seq3utr)\n", "\n", "# TASK SEVEN - Try extracting the 5'-UTR (this is the part of the mRNA molecule that is upstream of the coding sequence), how long is it?\n", "# TASK EIGHT - Try finding the lengths of the 5' and 3- UTRs of the sequences from tasks 1 and 2, above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3A : Runing BLAST and Parsing Output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now we're familiar with how to extract various bits from sequence objects that we download we can begin to use them as tools\n", "# We're going to run a remote BLAST job at NCBI using BioPython\n", "\n", "# To continue the example we are going to use the 3'-UTR sequence from LHX2 that we isolated above as the query sequence.\n", "\n", "# We have many choices for the target database:\n", "# nt - the full database\n", "# refseq_rna - Curated (NM_, NR_) plus predicted (XM_, XR_) sequences from NCBI Reference Sequence Project\n", "# refseq_genomic - Genomic sequences from NCBI Reference Sequence Project\n", "# for available databases, see:\n", "# ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf\n", "\n", "# Set the database\n", "database = 'refseq_rna'\n", "# Specify our query, this is the simplest type - a blastn algorithm search of the refseq_rna database using the LHX2 3-UTR sequence\n", "result_handle = NCBIWWW.qblast(\"blastn\", database, seq3utr)\n", "blast_record = NCBIXML.read(result_handle)\n", "\n", "# NB These searches can take a while depending on how busy the servers are so you may need to wait for c.5 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# NB the search above is simple, you can pass all the parameters to the search by modifying the qblast query string, for example:-\n", "\n", "# > blast_handle = NCBIWWW.qblast(\"blastp\", \"nr\",\n", "# peptide_seq,\n", "# expect=200000.0,\n", "# filter=False,\n", "# word_size=2,\n", "# composition_based_statistics=False,\n", "# matrix_name=\"PAM30\",\n", "# gapcosts=\"9 1\",\n", "# hitlist_size=1000)\n", "\n", "# You can choose from pretty much all of the BLAST command line parameters which are listed here - https://www.ncbi.nlm.nih.gov/books/NBK279684/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here we provide a small function that will take the results returned from your BLAST queries and formats it into a nice table\n", "\n", "def result_handle_table(blast_record , query_len) :\n", " \n", " t = PrettyTable(['Description', 'Max Score' , 'Total Score' , 'Query Cover' , 'E Value' ,'Per Ident' , 'Acc Len' , 'Accession' ])\n", " t._max_width = {\"Description\" : 30}\n", " for alignment in blast_record.alignments: \n", " score = 0\n", " max_score = 0\n", " query_cover = 0\n", " perc_ident = 100\n", " for hsp in alignment.hsps :\n", " score = score + hsp.bits\n", " max_score = max(max_score , hsp.bits)\n", " query_cover += ((hsp.query_end - hsp.query_start) / query_len)\n", " perc_ident = min(perc_ident , hsp.identities/hsp.align_length)\n", " if hsp.expect < 0.05 : \n", " t.add_row([alignment.hit_def.split('>')[0] , round(score) , round(max_score) , '{:3.0f}%'.format(query_cover*100) , '{:1.2e}'.format(hsp.expect) , '{:3.2f}%'.format(perc_ident*100) , alignment.length , alignment.accession])\n", " \n", " return print(t[0:10])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use the function like this (below) but make sure you've already run the cell above first or it won't know the function.\n", "result_handle_table(blast_record , len(seq3utr))\n", "\n", "# TASK NINE - Practice a variety of different search strategies with remote BLAST including adding additional parameters to control the search\n", "# TASK TEN - For at least some of the searches compare the tables produced from the function above to those from the search returned at the NCBI website." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3B : Taking Advantage of NCBI Taxonomy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# NCBI have a tremendous Taxonomy system that allows you to both specificlaly search based on species and to determine the species\n", "# of sequences/results you have returned.\n", "\n", "# Perform a search to obtain the taxid (taxonomic id - the equivalent of an accession number but for a species) for homo sapiens\n", "organism = 'Homo sapiens' \n", "search_string = organism\n", "\n", "#Now we have a search string to seach for ids\n", "handle = Entrez.esearch(db=\"taxonomy\", term=search_string)\n", "record = Entrez.read(handle)\n", "ids = record['IdList']\n", "print(ids)\n", "\n", "# TASK ELEVEN - Try retrieving taxonomic ids for several other species, check them on the NCBI Taxonomy website - https://www.ncbi.nlm.nih.gov/taxonomy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We create a BLAST query in just the same way as above, but this time we are trying a more complicated strategy\n", "# Here we try a tblastx strtegy starting with the 3'-UTR of LHX2 (non-coding remember)\n", "# The system will perform a 6-frame translation of the nucleotide query and of the nucletoide RefSeq RNA target database\n", "# It will try to find \"protein\" matches between the two.\n", "\n", "# Database is set to RefSeq_RNA\n", "database = 'refseq_rna'\n", "\n", "# We next stitch together a query for a particular species in this case:\n", "# \"txid9606 [ORGN]' using the taxid we found for humans in the cell above.\n", "# NB the slightly odd composition of this query term the organism specification is of the form \"txidXXXX[ORGN]\" where XXXX is the taxonommic ID number\n", "# So for humans this would be 'txid9606[ORGN]'\n", "organism = 'txid' + ids[0] + '[ORGN]'\n", "\n", "# We now perform the BLAST search speficially limiting it to a given species. This would be very useful if you wanted to to species \n", "# specific BLAST searches.\n", "result_handle = NCBIWWW.qblast(\"tblastx\",\n", " database,\n", " seq3utr,\n", " gapcosts = '11 1',\n", " entrez_query = organism,\n", " expect=0.05,\n", " word_size=3,\n", " matrix_name=\"BLOSUM62\",\n", " hitlist_size=100)\n", "\n", "blast_record = NCBIXML.read(result_handle)\n", "\n", "# NB This is likely to be a very time-consuming search as it is a tblastx (36-way) query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result_handle_table(blast_record , len(seq3utr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 4 : BLAST Challenges\n", "\n", "Finally you can try to put together a strategy to address the questions below using the techniques you've been practicing above\n", "\n", "* Perform a protein blast search to identify another organism with a fully function LHX2 gene\n", "* Identify the function of the LHX2 gene\n", "* Look for distantly related sequences and confirm in literature if this organism has a functioning LHX2 gene\n", "\n", "Tips for challenges involve performing a range of different BLAST searches for example:\n", "- searching different databases\n", "- using different BLAST algorithms\n", "- using different search parameters (such as different substitution matrices, remember the purposes of differnt PAM and BLOSUM matrices)\n", "- try some of the more exotic BLAST searches at the NCBI webite such as PSI-BLAST and DELTA-BLAST" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.8 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "vscode": { "interpreter": { "hash": "c79478e135452d4f8dcea3898ce85a4457be8d06848dc07bbec8d2854f4ceed7" } } }, "nbformat": 4, "nbformat_minor": 1 }