{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3 - Pairwise Alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will briefly show you how to perform pairwise alignments with Biopython and how to work with some alignment files. Note that we're only performing pairwise alignments here, using local and global approaches. Multiple sequence alignments will be covered later in the course." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# install and/or load BioPython\n", "%pip install biopython\n", "\n", "# replace this with your e-mail address\n", "EMAIL = 'A.N.Other@somewhere.com'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['BENNER22', 'BENNER6', 'BENNER74', 'BLOSUM45', 'BLOSUM50', 'BLOSUM62', 'BLOSUM80', 'BLOSUM90', 'DAYHOFF', 'FENG', 'GENETIC', 'GONNET1992', 'HOXD70', 'JOHNSON', 'JONES', 'LEVIN', 'MCLACHLAN', 'MDM78', 'NUC.4.4', 'PAM250', 'PAM30', 'PAM70', 'RAO', 'RISLER', 'SCHNEIDER', 'STR', 'TRANS']\n" ] } ], "source": [ "from Bio import Entrez\n", "from Bio import SeqIO\n", "from Bio import pairwise2 as pw\n", "from Bio import AlignIO\n", "from Bio import Align as al\n", "\n", "# list the available scoring matrices from the SubsMat module\n", "print(al.substitution_matrices.load())\n", "\n", "Entrez.email = EMAIL\n", "\n", "# the accession ids of human beta-globin and myoglobin proteins respectively\n", "protein_ids = ['NP_000509.1','NP_005359.1']\n", "\n", "handle = Entrez.efetch(db=\"protein\", id=protein_ids, rettype='fasta',retmode=\"text\")\n", "records = list(SeqIO.parse(handle, \"fasta\"))\n", "handle.close()\n", "\n", "# use these as sequence objects\n", "beta_globin = records[0].seq\n", "myoglobin = records[1].seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [pairwise2 module](https://biopython.org/docs/dev/api/Bio.pairwise2.html) has two main functions for alignment 'local' and 'global' when they are called you add two charcters on to those to define how you want to perform the search, for example globalxx or localxx. Those two letters define the following:-\n", "\n", "```\n", "CODE DESCRIPTION\n", "x No parameters. Identical characters have score of 1, otherwise 0.\n", "m A match score is the score of identical chars, otherwise mismatch\n", " score.\n", "d A dictionary returns the score of any pair of characters.\n", "c A callback function returns scores.\n", "The gap penalty parameters are:\n", "\n", "CODE DESCRIPTION\n", "x No gap penalties.\n", "s Same open and extend gap penalties for both sequences.\n", "d The sequences have different open and extend gap penalties.\n", "c A callback function returns the gap penalties.\n", "```\n", "\n", "Further details can be found [here](https://biopython.org/DIST/docs/api/Bio.pairwise2-module.html).\n", "\n", "As an example though a call of :-\n", "```\n", "pairwise2.align.globalms(\"ACCGT\", \"ACG\", 2, -2, -.5, -.1)\n", "```\n", "\n", "for 'm' you specify match (+2) and mismatch (-2) scores and then 's' you specify gap-open (-0.5) and gap-extend (-0.1) scores" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "172.5\n", "3\n", "148\n", "CLUSTAL X (1.81) multiple sequence alignment\n", "\n", "\n", "NP_000509.1 MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGD\n", "NP_005359.1 -MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKH\n", "\n", "NP_000509.1 LSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLH\n", "NP_005359.1 LKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHK\n", "\n", "NP_000509.1 VDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-\n", "NP_005359.1 IPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKE\n", "\n", "NP_000509.1 -----\n", "NP_005359.1 LGFQG\n", "\n", "\n", "\n" ] } ], "source": [ "# perform a pairwise local alignment using the pam250 substitution matrix\n", "mx = al.substitution_matrices.load('PAM250')\n", "\n", "alignments = pw.align.localds(beta_globin,myoglobin,mx, -10, -0.5)\n", "\n", "# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than \n", "# one backtrace mark in the hand-drawn alignments)\n", "print(len(alignments))\n", "\n", "# in the result we can extract several score features\n", "\n", "# the alignment score\n", "print(alignments[0][2])\n", "\n", "# the start of the alignment\n", "print(alignments[0][3])\n", "\n", "# the end of the alignment\n", "print(alignments[0][4])\n", "\n", "# note here we are using 'd' the pam250 scoring matrix and then 's' gap-open (-10) and gap-extend (-0.5)\n", "\n", "# unfortunately pairwise2 output looks awful (but its a good built in alignment method for you to practice\n", "# with sequence alignment matrices and scoring systems\n", "\n", "# so we're going to make a very basic fasta format file and use AlignIO to convert it into Clustal alignment\n", "# format which is much nicer to look at\n", "\n", "#create the fasta format\n", "# > aligned seq 1\n", "# SEQUENCE\n", "# > aligned seq 2\n", "# SEQUENCE\n", "\n", "alignment_fasta = \\\n", "\">\"+records[0].name+\" \"+records[0].description+\"\\n\"+alignments[0][0] \\\n", "+\"\\n\"+ \\\n", "\">\"+records[1].name+\" \"+records[1].description+\"\\n\"+alignments[0][1]\n", "\n", "# write it to a file\n", "fh = open('data/globin_alignment_pam250.fa','w')\n", "fh.write(alignment_fasta)\n", "fh.close()\n", "\n", "# read in the file using AlignIO\n", "alignment = AlignIO.read(\"data/globin_alignment_pam250.fa\", \"fasta\")\n", "\n", "# convert to clustal\n", "print(format(alignment,'clustal'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# perform a pairwise local alignment using the pam10 substitution matrix\n", "mx = al.substitution_matrices.load('PAM30')\n", "\n", "alignments = pw.align.localds(beta_globin,myoglobin,pam30, -10, -0.5)\n", "\n", "alignment_fasta = \\\n", "\">\"+records[0].name+\" \"+records[0].description+\"\\n\"+alignments[0][0] \\\n", "+\"\\n\"+ \\\n", "\">\"+records[1].name+\" \"+records[1].description+\"\\n\"+alignments[0][1]\n", "\n", "# write it to a file\n", "fh = open('data/globin_alignment_pam30.fa','w')\n", "fh.write(alignment_fasta)\n", "fh.close()\n", "\n", "# read in the file using AlignIO\n", "alignment = AlignIO.read(\"data/globin_alignment_pam30.fa\", \"fasta\")\n", "\n", "# convert to clustal\n", "print(format(alignment,'clustal'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# perform a pairwise global alignment using the pam250 substitution matrix\n", "mx = al.substitution_matrices.load('PAM250')\n", "\n", "alignments = pw.align.globalds(beta_globin,myoglobin,pam250, -10, -0.5)\n", "\n", "# this tells us how many alignments have the same optimal score (pretty useful, think of cells with more than \n", "# one backtrace mark in the hand-drawn alignments)\n", "print(len(alignments))\n", "\n", "# in the result we can extract several score features\n", "\n", "# the alignment score\n", "print(alignments[0][2])\n", "\n", "# the start of the alignment (NB global alignments must always start at 0)\n", "print(alignments[0][3])\n", "\n", "# the end of the alignment\n", "print(alignments[0][4])\n", "\n", "alignment_fasta = \\\n", "\">\"+records[0].name+\" \"+records[0].description+\"\\n\"+alignments[0][0] \\\n", "+\"\\n\"+ \\\n", "\">\"+records[1].name+\" \"+records[1].description+\"\\n\"+alignments[0][1]\n", "\n", "# write it to a file\n", "fh = open('data/globin_alignment_pam250_global.fa','w')\n", "fh.write(alignment_fasta)\n", "fh.close()\n", "\n", "# read in the file using AlignIO\n", "alignment = AlignIO.read(\"data/globin_alignment_pam250_global.fa\", \"fasta\")\n", "\n", "# convert to clustal\n", "print(format(alignment,'clustal'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge - Practice Experiments\n", "By modifying the code above try pulling different kinds of sequences to align using both local and alignment methods. Play with the choice of alignment matrix and different parameters to see what effects they have on the alignments generated." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.3 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "c79478e135452d4f8dcea3898ce85a4457be8d06848dc07bbec8d2854f4ceed7" } } }, "nbformat": 4, "nbformat_minor": 2 }