Week 5 - Exploring Biological Databases
For the next two weeks we are going to be learning about a range of databases that are available to interrogate and gather data about many of the biological systems and processes that we study. This week we will focus on becoming familiar with what’s available, what it can be used for and, crucially, how to recover that data and begin working with it ourselves. Taking that forward we will be working through some case studies next week in some of the common application areas.
Introduction
Exploring Biological Databases
This week we are going to be learning about a selection of databases that are available to interrogate and gather data about many of the biological systems and processes that we study. This week we will focus on becoming familiar with what’s available, what it can be used for and, crucially, how to recover that data and begin working with it ourselves. Taking that forward we will be working through some case studies next week in some of the common application areas.
Data Handling
There are many ways to obtain and subsequently work with biological data, but typically we can break these down into those that use tools and those that involve coding or scripting. This mirrors the discussion we had in week 1 about the different ways in which researchers carry out their work, both are equally valid, but they both also have consequences.
Tools-based access
Broadly speaking, tool based access is more accessible and often specifically designed to allow the user to achieve common tasks and requests without too steep a learning curve. Also, behind the scenes considerable effort will often have been put in to make sure the tool performs exactly as designed minimising the risk of imprecise specification in requests and in return formats. It is also easy to share what was done with other, in fact many such tools offer the ability to save and share searches which can facilitate data sharing and open research practices. On the downside not all sources of data may provide such tools, requiring fairly low-level direct access to data often in the form of large and frequently complex flat-files. For those sources that do provide tools, they may be quite restrictive in what parts of the data you can access, how you can build your queries and what format it can be returned in.
Programmatic access
It is fair to say that most experienced Bioinformaticians will use a programmatic method to access and process data once they get beyond the preliminary exploratory phase. This can be anything from low-level UNIX commands such as 'curl' and 'sftp', shell-scripts that collate many commands together in simple programs, though to scripting languages such as Perl, Python and R that commonly have packages for accessing specific data sources. All of the major sources of biological data expose services for programmatic access either through direct FTP like downloads of raw data or through database or API level access to services. Increasingly these data sources are employing advanced modern methods for exposing services rather than designing bespoke systems in a specific programming language. This means that a Bioinformatician can commonly access and interrogate the data source using a programming method of their choosing. Clearly, the versatility of accessing data through a programmatic route is often greater, but it does come with risks. It is not uncommon for custom code written by Bioinformaticians to not perform as intended. This could be due to misunderstanding the data structures and services offered by the data source or poor code.
Databases for Genes - NCBI & Ensembl
Gene Databases (Part One)
Gene Databases (Part Two)
Literature Searching - PubMed
In the video we refer to a few useful pages that contain helpful information for using PubMed.
Searching PubMed
Structuring Data for Biology - BioPortal & EBI-OLS
In this video we introduce ontologies as structures that enable us to annotate and analyse biological data in a consistent and comparable manner and encode prior knowledge about the biological systems we are studying.
Protein-Protein Interaction Data - Intact & BioGrid
In this video we introduce two of the largest protein-protein interaction databases BioGrid & Intact (EBI). We discuss some of the factors to consider when querying and interpreting data from including the importance of recognising and evaluating the confidence levels associated with individual interactions. These databases are a nice example of how reporting standardisation (through the HUPO PSI-MI3.0 data definition) and incorporation of bespoke ontologies for protein interactions (MI Ontology) has allowed for strong provenance and the development of statistical approaches to evaluate (and infer) interactions.
Biological Pathways - KEGG & Reactome
In this video we introduce KEGG and Reactome (EBI), two pathway database resources that can help us to gain biological insight into the data we are analysing. They essentially allow us to view our data from the perspective of the biological pathways and systems that effect higher level properties such as cell-cycling, neurotransmitter signalling, glycolysis etc.. These are by nature hierarchical meaning that these can be viewed across large biological scales from very precise molecular mechanistic reactions up to whole tissue or organ systems. These resources map across species as well which allows researchers to take results from model organisms (for example) and project them into analogous systems in other species such as the human.
Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, Sidiropoulos K, Cook J, Gillespie M, Haw R, Loney F, May B, Milacic M, Rothfels K, Sevilla C, Shamovsky V, Shorser S, Varusai T, Weiser J, Wu G, Stein L, Hermjakob H, D'Eustachio P. The reactome pathway knowledgebase. Nucleic Acids Res. 2020 Jan 8;48(D1):D498-D503. doi: 10.1093/nar/gkz1031. PMID: 31691815; PMCID: PMC7145712.
Gene Expression Data - NCBI-GEO & ArrayExpress
In this video we introduce GEO (NCBI) & ArrayExpress (EBI) two databases that allow researchers to query and download vast quantities of gene expression data for analysis. Both repositories enforce very strict standards on submissions to ensure that data is of suitable quality and accompanied by the essential meta-data required to allow for reproducible and properly designed analysis. These are additional examples of initiatives that have established and/or adopted community standards of practice to maintain quality and engender confidence from their scientific user base.
Lecture 5 - Exploring Biological Databases
The lecture slides for Week 5 - "Exploring Biological Databases" are available here. This week I will be demonstrating use of the databases we use from the web live in the lecture so there are only a few slides.
The video of the lecture is available from the GitHub video area here.
Reading Lists & Resources
Each week we will have an accompanying reading list with some articles & web-sites for self study to support the course. You can find the course "Resource List" - here. We will continue to curate the list throughout the course especially if things pop up in the lectures and practicals that we want to add a reference or link to so do please check back in on the list from time to time.
We have generally tried to identify resources as "Essential", "Recommended" or "Further Reading" in an attempt to help you prioritise your reading during the course.
Finally a very important time to draw your attention to what you can consider the "core text" for the course, which is the excellent "Bioinformatics & Functional Genomics" Third Edition by Jonathan Pevsner. You will be pleased to know that this text-book is available free online as part of the University's subscription portfolio. You can find it right at the top of the resource list. If you have any problems accessing or using any of the above please do drop us a comment in the Discussion forum and we will try to get things resolved as soon as possible.
This week you should browse BFG for examples of some of these databases in use, especially material about biological databases in Chapter 2, but you don't need to read the whole chapter. Other useful things you might like to browse are some fo the useful guides provided by some of the resources above such as:-
- NCBI Training Tutorials - https://www.ncbi.nlm.nih.gov/guide/training-tutorials/ (very good)
- PubMed User Guide - https://pubmed.ncbi.nlm.nih.gov/help/ (very good)
- Bioportal Help - https://www.bioontology.org/wiki/BioPortal_Help
- Biogrid Help - https://wiki.thebiogrid.org
- Reactome User Guide - https://reactome.org/userguide (very good)
It will take more time than we have in the course for you to become comfortable with all of the resrouces. The aim here is to introduce you to them and give you some experience of using them (which we will do next week) so that in the future you will know where to start when looking for the different kinds of data.