Bioinformatics--CougarLab

Modeling the effect of TYW1 knockout on ribosomal frameshift

The modeling of TYW1-knockout effect on ribosomal frameshift was based upon the fact that TYW1 is a critical enzyme involved in the synthesis of wybutosine, so the hypomorphic and null alleles of TYW1 could reduce and even remove the production of wybutosine accordingly. It is known that wybutosine at the 37 position of tRNAPhe near the 3’ base of anticodon can stabilize the codon-anticodon interaction thus drastically reduce the ribosomal frameshift upon a specific UUU codon 42. We, therefore, hypothesized that the density and distribution of UUU codons along a specific mRNA should determine the extent of influence exerted by wybutosine, and in parallel, the relative abundance of wybutosine was another major effector during this process. Our modeling procedure went as follows:

1. Download the UCSC RefSeq (refGene) gene models of human, mouse, and zebrafish from https://genome.ucsc.edu/cgi-bin/hgTables. Extract the coding domain sequences (CDS) of each protein-coding gene. If there exists multiple isoforms, choose the longest ones.

2. Label the location of UUU codons, normalize the location values by the length of CDS. By doing this, we harvested a series of normalized location values, i.e., L(1), L(2), ..., for each gene, where 0 < L(n) < 1, n = 1, 2, ....

3. We made the following two assumptions: (1) for a UUU codon at any location, there is a fixed chance of ribosomal frameshift designated by Pf, given a fixed level of wybutosine; (2) for a given UUU codon located at L(n), after frameshifting, the remaining activity of a protein is proportionate to L(n). Thus, the consequence of translating a mRNA holding multiple UUU codons, i.e., the remaining activity of a protein, designated by R, can be estimated as: R = ∑ (L(n) * Pf * Pr(n)) + Pr(n), where Pr(1) = 1 and Pr(n) = Pr(n - 1) * (1 - Pf), n = 1, 2, ....

4. For the wild-type (WT) and TYW1-knockout mice (KO), we defined the R values for WT and KO to be R(WT) and R(KO), respectively. The R value is determined by the density and distribution of codon UUU in a given mRNA, also by Pf (the chance of ribosomal frameshift at a specific location of mRNA, given a fixed level of wybutosine). We set the values of Pf for R(WT) and R(KO) to be 0.12 and 0.35, respectively 42. Subsequently, we defined the attenuation coefficient to be R(KO) / R(WT). An attenuation coefficient reflected the extent of protein activity reduction with TYW1-knockout, and the expected values of attenuation coefficient ranged from 0 to 1. The lower the value of attenuation coefficient, the more severe the ribosomal frameshift and protein degradation when TYW1 is knocked out.

5. After generating a list of attenuation coefficients for each human / mouse / zebrafish
protein, we focused on proteins meeting with these criteria: (1) a protein with ortholog in all three species; (2) a protein with attenuation coefficient less than 0.34 (i.e., 0.12 / 0.35); (3) a protein with its corresponding gene matching TYW1 in terms of expression profiles in normal brains; (4) a protein known to be disease-related, with records in HGMD or OMIM.

Gene ontology and KEGG pathway analysis

Gene ontology analysis was performed on 114 CP-related genes and 706 NDD-related genes 10 separately using GO enrichment analysis tool of Omicshare (https://www.omicshare.com/tools/home/report/goenrich.html). The top 27 GO terms with p-value < 0.01 were selected and presented. To NDD-related genes, the same GO terms were filtered with p-value < 0.01, and only those with rich factor > 0.09 (for Biological process and cellular component) and 0.07 (for Molecular function) were indicated. KEGG pathway analysis was performed on 114 CP-related genes and 706 NDD-related genes 10 separately using pathway enrichment analysis tool of Omicshare (https://www.omicshare.com/tools/home/report/koenrich.html). The top 27 pathways with p-value < 0.05 were selected and presented. To NDD-related genes, same pathways were filtered with p-value < 0.05, and only those with rich factor (the ratio between number of genes from the present gene-set and all genes enriched in a certain pathway) > 0.1 were indicated.

Expression analysis of CP genes

To visualize the expression pattern of CP-related genes in different brain regions, we extracted the bulk RNA-seq data of CP-related genes from the published data of Li et al.. On the other hand, to investigate the expression profile of CP-related genes in different cell types, we used the R package Seurat to import, filter, normalize and scale (with default settings) the single cell RNA-seq data for human brain obtained from the published data of Li et al. 11 and Lake et al. 12. The expression level of CP-related genes in different brain regions and different cell types were displayed in the form of heatmaps by using TBtools software.

Protein-protein interaction network

Protein-protein interaction network between CP-related genes and NDD-related genes was obtained from the STRING database (https://string-db.org/). The interaction network model was generated with Cytoscape.

Homology modeling

Domains of TYW1, GPAM, PTK7 and TARS were predicted by the InterProScan protein domains identifier (http://www.ebi.ac.uk/interpro/search/sequence-search). For GPAM, PTK7 and TARS, unknown domains were also predicted with the MEME (Multiple EM for Motif Elicitation) motif discovery tool (http://meme-suite.org/tools/meme). Multiple sequence alignments of TYW1, GPAM, PTK7 and TARS were generated by using the ClustalW program. Three-dimensional structural models of TYW1 (TYW1_WT, TYW1_R206C and TYW1_R389Q), GPAM (GPAM_WT, GPAM_G499R and GPAM_P669S), PTK7 (PTK7_WT, PTK7_G476R and PTK7_P698R) and TARS (TARS_WT, TARS_T258S and TARS_Q702R) were predicted by the I-TASSER web tool (http://zhang.bioinformatics.ku.edu/I-TASSER/). Visual representation of models and structural superposition were generated by a software package named visual molecular dynamics (VMD). The mutant stability change (ΔΔG) of variants of TYW1 (TYW1_R206C and TYW1_R389Q), GPAM (GPAM_G499R and GPAM_P669S), PTK7 (PTK7_G476R and PTK7_P698R) and TARS (TARS_T258S and TARS_Q702R) were predicted by using the STRUM server (https://zhanglab.ccmb.med.umich.edu/STRUM/).

Bioinformatic analyses of m⁶A-seq

The reads containing adaptor contamination, low quality bases and undetermined bases were removed using the fastp software with default parameter. Sequence quality of IP and Input samples was verified by Fastp. The reads were then mapped to the reference genome Mus musculus (Version: v99) using HISAT2. The m⁶A peaks with bed or bigwig format that can be adapted for visualization on the IGV software (http://www.igv.org) were identified by the R package exomePeak. MEME and HOMER were used for de novo and known motif finding, followed by localization of the motif with respect to peak summit. Called peaks were annotated using R package ChIPseeker. The expression level for all mRNAs from input libraries was determined using StringTie by calculating FPKM (total exon fragments /mapped reads (millions) × exon length (kB)). Gene ontology analysis was performed on the 1622 m⁶A downregulated genes using GO enrichment analysis tool of Omicshare. KEGG pathway analysis was also performed on the 1622 m⁶A downregulated genes using pathway enrichment analysis tool of Omicshare. GO terms and KEGG pathways with FDR corrected p < 0.05 was considered significant. Protein-protein interaction network was obtained from the STRING database. The interaction network model was generated using Cytoscape 3.6.1. The lastest gene list was downloaded from the BioMart database (https://www.ensembl.org/biomart/martview/) where 447 neural differentiation associated genes were identified according to their functional annotation and related GO terms.

Tandem mass tag (TMT) proteomic analysis

Total proteins were extracted from testis, and Tandem mass tags (TMT) proteomics were conducted as described elsewhere. Gene set enrichment analysis was performed using GSEA software v3.1.0 (www.broadinstitute.org/gsea/index.jsp) separately against different protein sets: Acrosomal vesicle associated proteins, Midpiece associated proteins, principal piece associated proteins; Sperm motility associated proteins; and Intraflagellar transporter proteins.

Domain and conserved motifs analysis

Protein domains of Fsip1 were predicted by the NCBI's Conserved Domain Database (CDD) (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). Conserved motifs were predicted by MEME (Multiple EM for Motif Elicitation) motif discovery tool [28] (http://meme-suite.org/tools/meme). Amino acid sequence of Fsip1 of eight species, Homo sapiens (NP_001311267.1), Mus musculus (NP_001361601.1), Gallus gallus (XP_015143418.1), Pelodiscus sinensis (XP_014429496.1), Salmo salar (XP_014065924.1), Acanthaster planci (XP_022107759.1), Mizuhopecten yessoensis (XP_021347285.1) and Stylophora pistillata (PFX17399.1) were included. The minimum and maximum width of each motif was set to be 30 aa and 100 aa, respectively. Each motif must be recognized in at least 2 sequences. The visual representations of motifs were generated by TBtools.

Homology modeling and protein-protein docking

Three-dimensional structural models of Fsip1 and Ift20 were predicted by the I-TASSER web tool (http://zhang.bioinformatics.ku.edu/I-TASSER/). The models with the highest C-score were selected for docking. Flexible protein-protein docking for Fsip1 and Ift20 was performed using HADDOCK webserver (https://bianca.science.uu.nl/haddock2.4/submit/1). The active and passive residues of Fsip1 and Ift20 used to define the ambiguous interaction restraints (AIRs) in docking were predicted by the CPORT server (http://alcazar.science.uu.nl/services/CPORT/). The 1st complex model with the top cluster (the lowest HADDOCK score) was selected. Visual representation and analysis of models were conducted using software package visual molecular dynamics (VMD).

Preprocessing of single-cell RNA-seq data

The Cell Ranger v3.0.0 mkfastq (10X Genomics) was utilized to demultiplex the cellular barcodes. Adapters, low-quality reads and low-quality bases were removed using Trimmomatic software. Basic statistics on the quality of the clean data reads was performed with FastQC.

The clean reads were aligned to the murine reference genome (mm10) using the Cellranger count command. The R package Seurat (3.1.0) was applied to convert the unique molecular identifier (UMI) count matrix to Seurat objects. The genes expressed in less than 10 cells and cells with less than 200 expressed genes were removed. The expression data was normalized using the NormalizeData function, in which the UMI counts were divided by the total number of UMIs per cell, multiplied by 10000 for normalization and then log-transformed. After normalization, 2000 highly variable genes were identified using the FindVariableGenes function. We then used the FindIntegrationAnchors and Integratedata functions to combine the nine samples from WT, KI and OE mice. The principal component analysis was performed with the RunPCA function. The top 30 PCA components were used for the RunTSNE function and the FindClusters function. Fourteen clusters were identified for the WT, KI and OE mice, respectively. The marker genes were found using the FindConservedMarkers function.

Analysis pipeline of deep sequencing data

The variant prioritization and evaluation of etiologic involvement were performed by two pipelines, namely, MERAP and ANNOVAR. In brief, MERAP pipeline first filters all identified variants through comparison with the disease-associated variants in the Human Gene Mutation Database (HGMD, 2020.2) and the Online Mendelian Inheritance in Man (OMIM) to collect those known disease-causing variants. In order to filter out neutral variants, MERAP uses entries from the dbSNP143 (http://www.ncbi.nlm.nih.gov/projects/SNP/), 1000 Genome (http://www.1000genomes.org/), NHLBI Exome Sequencing Project (ESP, http://evs.gs.washington.edu/EVS/), ExAc (http://exac.broadinstitute.org/), and the Chinese Millionome Database (CMDB) (https://db.cngb.org/cmdb/) as screening databases. In principle, candidate variants causing recessive traits should not occur in healthy controls as homozygotes, and the frequency of respective heterozygotes should not exceed 0.1%. MERAP uses the RefSeq genes (http://www.ncbi.nlm.nih.gov/refseq/) as reference, and nonsynonymous changes are described in terms of gene ID, base change, protein change, genomic coordinate, transcript coordinate, protein coordinate, protein length, affiliated with gene description from the Human Gene Nomenclature Committee (HGNC, http://www.genenames.org/). Changes destroying conventional splice sites or introducing novel splice sites are identified by MERAP’s module called SSFinder. To assess the pathogenicity of missense mutations, MERAP generates a single score integrating the results of seven different algorithms, including the Grantham score, PhyloP, GERP, SIFT, PolyPhen2, Mutation-Taster, and the Conserved Domains Database (CDD, http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). With empirical false discovery rate cutoffs, this score serves as dichotomized pathogenicity predictions even if any two of the seven algorithms might not coincide, as is often the case. MERAP rules out candidate genes reported to harbor homozygous loss-of-function (LOF) variants in healthy individuals, which applies to >1% of the human genes. Typically, if more than three independent truncating variants are observed in >10 of the exomes listed in the 1000 Genome, Exome Variant Server (ESP6500), NHLBI GO Exome Sequencing Project (ESP) (http://evs.gs.washington.edu/EVS/), and ExAC (http://exac.broadinstitute.org/) databases, the relevant gene is flagged as LOF tolerant. To facilitate the choice between few remaining candidate genes, MERAP also provides a list of ~4,500 known disease genes extracted from OMIM (http://www.ncbi.nlm.nih.gov/omim), ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/), and HGMD (http://www.hgmd.org/), as well as more than 8,000 associated disorders and their symptoms. For novel candidate genes without prior link to disease, MERAP offers information on their interaction with known disease genes, based on data from Biogrid (http://thebiogrid.org/) and IntAct (http://www.ebi.ac.uk/intact/), assuming that genes implicated in clinically similar disorders tend to cluster in gene or protein interaction networks. Variants were considered de novo if neither parent had the variant, and candidate variants were selected by segregation analysis. The pathogenic and likely pathogenic genes/variants were defined according to the standards and guidelines of ACMG.

The ANNOVAR pipeline was also used, with additional considerations from the Residual Variation Intolerance Score (RVIS) and the Combined Annotation-Dependent Depletion (CADD) score. The former ranks genes in terms of intolerance to functional genetic variation and the latter integrates several well-known tools. We empirically set a cutoff RVIS score of 50th percentile for known and novel genes and a cutoff CADD score of 20 for novel candidate genes. When defining likely causal variants/genes, we followed the guidelines designated by ACMG.

All variants of putative clinical relevance were confirmed by the conventional PCR and Sanger sequencing. Parent-child relationships were confirmed by using the PLINK with SNPs drawn from WGS which matched the CytoScan SNP repertoire. The false positive rate of WGS was generated by the comparison between Sanger sequencing and WGS, and we found no false positive variant called by our pipeline. The false negative rate of WGS was evaluated by the SNP genotyping comparison between WGS and CytoScan high-density microarray, which resulted in an average false negative rate of less than 0.1%.