Google Summer of Code 2020 | Work Report

6 min readAug 28, 2020

Student: Rahul Agrawal
Github: https://github.com/rahul799
Organization: National Resource for Network Biology (NRNB)
Hosting organization: Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR)
Project Title: Adding Network Analysis and Visualisation Features
Mentor — Dr. Mohamed Helmy
Co-mentors: Dr. Kumar Selvarajoo, Thuy Tien Bui

Overview

ABioTrans is a bio-statistical/informatics tool, developed in R for gene expression analysis. The tool allows the user to directly read RNA-Seq data files deposited in the Gene Expression Omnibus or GEO database. It provides easy options for performing very commonly used statistical techniques, namely, Pearson and Spearman rank correlations, Principal Component Analysis (PCA), k-means and Hierarchical clustering, Shannon entropy, Noise (square of the coefficient of variation), Differential Expression (DE), and Gene Ontology Classifications.

Aim

This project aims to develop the 2nd version of ABioTrans. ABioTrans version-2 will be web-based and will provide extensive analysis options to the gene expression analysis results such as pathway enrichment analysis, gene function analysis as well as publication-ready visualization options including network visualization using Cytoscape.JS

As the current version of Abiotrans is not online, people who want to use the software need to install the RStudio followed by a lot of many packages & dependencies from CRAN & BioConductor package manager. As it’s a very time consuming & tedious process. So, taking it online will solve a huge problem.

Overview of Work Done

Following are the few major features which I implemented through the coding period:

Random Forest: While generally used for classification and regression problems, here the random forest is used for clustering. The plot can be navigated in a similar manner to the updated heat scatter.

There are 2 parameters to control the random forest cluster plot. No. of trees defines the size of the forest. No. of clusters classifies the samples into the specified cluster size.

Along with the Random Forest, to learn the cell-cell similarities from RNA-seq data I also Implement a two-step procedure RAFSIL using feature construction and random forest-based similarity learning for single-cell RNA sequencing data.

Related Pull Request:- Link1 & Link2

2. Self Organizing Maps ( SOM ): A self-organizing map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction.

Five types of SOM plots are available for use:

Property: Uses values of codebook vectors (weight of gene vectors) and output as colored nodes
Count: Shows how many genes are mapped to each node
Codes: Displays codebook vectors
Distance: Shows how close genes are from each other when they are mapped
Cluster: Uses hierarchical clustering to cluster the SOM

There are 4 parameters to control the SOM plots. Samples used determines the sample chosen for the plots, either all samples or individual ones. No. of horizontal grids and No. of vertical grids changes the number of nodes used in the SOM. No. of clusters classifies the SOM nodes into the specified cluster size (for cluster plot).

Related Pull Request:- Link

3. Sparse PCA: In the updated PCA, an option has been added to toggle between using PCA or sparse PCA. Now the users can also visualize the output in the 3D plot.

Related Pull Request:- Link

4. t-SNE: the t-SNE directional plot is used to find the paths of samples across different time points. The script for the t-SNE directional plot utilizes the preprocessing method from DigitalCellSorter. PCA is applied to the preprocessed data, then t-SNE on the PCA step.

Related Pull Request:- Link

5. Gene Ontology: Now the data is fetched through API calls from the Uniprot Knowledge Data Base, unlike the older version which used the static DB’s and it often gets outdated very soon as the protein data is updated on a regular interval.

Related Pull Request:- Link

6. Protein-Protein Interaction and adding cool visualization using Cytoscape.JS: PPIs are physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding, and the hydrophobic effect.

The output can’t be visualized using a table or even a 3D plot. So I used the Cytoscape.JS API which offers graphical visualization of the data having nodes, edges and the user can select the different layouts from a list.

There are many different options that I integrated along with different color options, Select First Neighbor, Fit Graph, et Cetra.

Related Pull Request:- Link

7. Pathway Enrichment: Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance.

8. Complex Enrichment: Here the user can upload the UniProt accession Ids with the help of a CSV file and they can know about the complex names associated with the protein in a table format that can be downloaded as a CSV file.

9. Four more features added: Apart from all the above feature, there are four more feature inside the Gene Set Analysis which are :

Protein Function
Protein Expression
Subcellular Expression
Protein Domains

The names are self-explanatory and they are used to retrieve the data in a table similar to the complex enrichment.

Challenges

The major challenge I faced during the web deployment because the whole dashboard is made up with the help of R and Shiny Package. So It needs some special server to deploy it over the world wide web. There were just two options which can solve our problem.

Shinyapps.io — They have there own private servers and it’s quite easy to deploy it there but yes, it is very costly.
Shiny Server — It is an open-sourced software, which is completely free but it comes with some limitations like it won’t provide proper logs which makes debugging a bit harder.

But I choose to go with the second option and after spending some sleepless nights I was able to run it successfully and yeah !!! It worked.

Code Contributions for different features

All the Merged PR can be found here: Link

Final Result

So the final dashboard can be found here http://combio-sifbi.org/ABioTrans/
It may take a while to load as we are using a different server to compile the code written in R.

Student’s Profile — Rahul Agrawal

Google Summer of Code 2020 | Work Report

Overview

Aim

Overview of Work Done

Challenges

Code Contributions for different features

Final Result

Student’s Profile — Rahul Agrawal

Written by Rahul Agrawal