Modes

The Rp3 pipeline has different modes responsible for different steps in the data processing workflow. Run the modes in the adequate order using the same output directory provided with --outdir. Some modes are ‘extra’ and will provide additional layers of information about the identified microproteins.

Database

  1. Run the RP3 pipeline in its database mode to generate a custom database for mass spectrometry-based proteomics. Run the pipeline with $ rp3.py database -h to print all available commands for this mode.

  2. Run the database mode
    • Alternatively, you can run the translation mode separately, in case you do not need the reference proteome, decoy sequences and contaminants added to the fasta file. You can always run the database mode with the --skip_translation flag later on to generate a database suited for peptide search for mass spectrometry data.

    • By default, database automatically executes the translation mode.

  3. Parameters

    Run $ rp3.py database -h on the command line to check which commands are available. This will print out this message:

General Parameters:
  database
  --outdir OUTDIR, -o OUTDIR
                        Inform the output directory (default: foopipe)
  --threads THREADS, -p THREADS
                        Number of threads to be used. (default: 1)

database options:
  --proteome PROTEOME   Reference proteome (default: None)
  --genome GENOME
  --gtf_folder GTF_FOLDER
  --external_database EXTERNAL_DATABASE
  --skip_translation

Example code

4.1. Generating a custom database for proteomics from a transcriptome assembly or custom GTF files

$ rp3.py database --outdir <path/to/output/directory> --threads 8 --genome <path/to/genome.fasta> --gtf_folder <path/to/gtf/folder> --proteome <path/to/reference_proteome.fasta

This will tell the pipeline to generate databases in the databases directory inside the output directory provided with --outdir.

**Remember to always specify the same output directory in the other modes, as each step is dependent on the previous one. **

4.2. In case you already have your own database with putative sequences, you can provide it to the pipeline with the --external_database flag along with the reference proteome. In that case, the pipeline will append the reference proteome to your fasta file along with the decoys and contaminants. It will then tag each sequence for the subsequent steps. This is useful when not translating a transcriptome assembly to the 3-reading frames, like when searching for mass spectrometry evidence for microproteins identified with Ribo-Seq.

2.3. Notes
  • --gtf_folder expects a folder with one or more GTF files or the path to a GTF file.

  1. Output files

    Database files will be generated inside the output directory in the databases folder. You can check for database metrics at metrics/database_metrics.txt.

Notes

Unless specified in the next modes, databases will have annotated proteins excluded based on the -proteome provided. This behavior is intended to remove, by default, any already annotated microprotein and peptides that match to them. If you are interested in annotated proteins as well, there are two ways to prevent this from happening: - Specify the parameter --uniprotAnnotation during the database mode. This will use a data frame obtainable from Uniprot containing the annotation level for each protein in the --proteome. Rp3 will then tag differently each protein in the provided proteome based on their annotation level. As proteins in Uniprot vary in their annotation, it might be wise to include those with very low annotation levels when identifying unannotated proteins, as they are poorly characterized. The annotation level ranges from 1 to 5. To control which should be kept, provide the argument --annotationLevel followed by a number from 1 to 5. --annotationLevel 4 will keep only those proteins with an annotation level equal to or lower than 4. This is the default behavior if --uniprotAnnotation is provided. Requires ``–includeLowAnnotation`` to be specified during the ``search`` mode.

  • Specify --keepAnnotated in search mode, the next step. This will treat every protein, annotated or unannotated, the same.

Post-processing

  1. The RP3 pipeline contains a re-scoring mode called rescore. This is intended to perform a second round of searches, now using as a proteomics database the results from the first proteogenomics search (the fasta file generated by the search mode) appended to the reference proteome. This is useful because the FDR assessment from the first search is not very accurate, as the database generated from the three-frame translation of the transcriptome contains millions of predicted sequences. This bloated database results in false positives and false negatives during FDR assessment. To correct for this, we select the hits at an FDR < 0.01 from the first search and look for them again, now with a smaller database to obtain more accurate hits. This mode will reduce the final number of novel microproteins.

  2. This mode can be run during search mode as well.

  3. If the flag --rescore is not specified during the search mode, it is also possible to perform the re-scoring

on the previous run. After running the search mode, run the rescore in the same output directory:

$ rp3.py rescore --outdir /path/to/output/directory --threads 8 --mzml /path/to/mzmz/files --proteome /path/to/reference/proteome --msPattern mzML

Notes

  • The --msPattern specifies the format of the files (usually mzML or bruker (.d) format).

Output files

Look for output files in fasta and gtf format in the rescore/summarized_results directory inside the output directory.

Ribocov

This mode will check for Ribo-Seq coverage for the microproteins identified with proteogenomics. To do so, it will run featureCounts on a custom GTF file automatically generated by the pipeline. The available parameters are:

General Parameters:
  ribocov
  --outdir OUTDIR, -o OUTDIR
                        Inform the output directory (default: None)
  --threads THREADS, -p THREADS
                        Number of threads to be used. (default: 1)

ribocov options:
  --fastq FASTQ         Provide the path to the folder containing fastq files
                        to be aligned to the genome. If the --aln argument is
                        provided, this is not necessary. (default: None)
  --gtf GTF             Reference gtf file containing coordinates for
                        annotated genes. The novel smORFs sequences from the
                        proteogenomics analysis will be appended to it.
                        (default: None)
  --genome_index GENOME_INDEX
                        Path to the genome STAR index. If not provided, it
                        will use the human hg19 index available at /data/
                        (default: None)
  --cont_index CONT_INDEX
                        STAR index containing the contaminants (tRNA/rRNA
                        sequences). Reads mapped to these will be excluded
                        from the analysis. (default: None)
  --aln ALN             Folder containing bam or sam files with Ribo-Seq reads
                        aligned to the genome. In case this is provided,
                        indexes are not required and the alignment step will
                        be skipped. (default: None)
  --rpkm RPKM           RPKM cutoff to consider whether a smORF is
                        sufficiently covered by RPFs or not. (default: 1)
  --multimappings MULTIMAPPINGS
                        max number of multimappings to be allowed. (default:
                        99)
  --adapter ADAPTER     Provide the adapter sequence to be removed. (default:
                        AGATCGGAAGAGCACACGTCT)
  --plots
  --fastx_clipper_path FASTX_CLIPPER_PATH
  --fastx_trimmer_path FASTX_TRIMMER_PATH

To run the RP3 pipeline on ribocov mode, run:

This will use the provided genome indexes for the human hg19 assembly located inside the STAR_indexes directory, located inside the rp3 main directory. The user can also generate new indexes if they require to do so. In that case, provide the path to them using the parameters --genome_index and cont_index. Make sure to change the --adapter parameter to suit the adapter sequence used for your Ribo-Seq experiment. The output files will be located inside the counts directory. They will include a heatmap showing the overall Ribo-Seq coverage for the proteogenomics smORFs, as well as a table containing information about the mapping groups. If the --plots argument was specified, a plot showing the number of Ribo-Seq-covered smORFs in each mapping group will be generated at counts/plots.

Annotation mode

Rp3 provides an additional mode, anno, to provide additional information on the identified microproteins. Running Rp3 with rp3.py anno --help will return:

 ____       _____
|  _ \ _ __|___ /
| |_) | '_ \ |_ \
|  _ <| |_) |__) |
|_| \_\ .__/____/
      |_|
RP3 v1.1.0
usage: /home/microway/PycharmProjects/rp3/rp3.py anno [-h] [--outdir OUTDIR]
                                                      [--threads THREADS]
                                                      [--overwrite]
                                                      [--signalP]
                                                      [--organism ORGANISM]
                                                      [--conservation]
                                                      [--blast_db BLAST_DB]
                                                      [--rescored]
                                                      [--uniprotTable UNIPROTTABLE]
                                                      [--orfClass]
                                                      [--paralogy] [--mhc]
                                                      [--repeats] [--isoforms]
                                                      [--exclusiveMappingGroups]
                                                      [--affinity AFFINITY]
                                                      [--affinityPercentile AFFINITYPERCENTILE]
                                                      [--filterPipeResults]
                                                      [--genome GENOME]
                                                      [--alignToTranscriptome]
                                                      [--maxMismatches MAXMISMATCHES]
                                                      [--gtf GTF]
                                                      [--repeatsFile REPEATSFILE]
                                                      [--refGTF REFGTF]
                                                      anno

Run pipeline_config in anno mode

options:
  -h, --help            show this help message and exit

General Parameters:
  anno
  --outdir OUTDIR, -o OUTDIR
                        Inform the output directory (default: None)
  --threads THREADS, -p THREADS
                        Number of threads to be used. (default: 1)
  --overwrite

anno options:
  --signalP
  --organism ORGANISM
  --conservation
  --blast_db BLAST_DB
  --rescored            Use this flag if the 'rescore' mode was used to
                        perform a second round of search using the results
                        from the first search. Only the rescored microproteins
                        will be analyzed for conservation in this case.
                        (default: False)
  --uniprotTable UNIPROTTABLE
  --orfClass
  --paralogy
  --mhc
  --repeats
  --isoforms
  --exclusiveMappingGroups

MHC detection parameters.:
  --affinity AFFINITY
  --affinityPercentile AFFINITYPERCENTILE
  --filterPipeResults

Paralogy parameters.:
  --genome GENOME
  --alignToTranscriptome
  --maxMismatches MAXMISMATCHES

ORF Classification parameters.:
  --gtf GTF             reference GTF file. For better accuracy in annotation,
                        this should be a GTF file from Ensembl. They contain
                        more terms that help better classifying the smORF.
                        (default: None)

Repeats parameters.:
  --repeatsFile REPEATSFILE

Isoforms parameters.:
  --refGTF REFGTF

With this mode, it’s possible to identify signal peptides (running signalP6.0), conservation, orf classes, presence of MHC epitopes, and presence of paralogs in the genome. To identify signal peptides and annotate smORF classes, run Rp3 as:

$ rp3.py anno --outdir /path/to/outdir/from/previous/modes --signalP --orfClass --gtf /path/to/ensembl/gtf```

To define smORF classes in the manuscript, we used the annotation from Ensembl, which we believe to be very comprehensive and allows us to get better insight into our data. To obtain a GTF file from the human genome assembly hg19, for instance, go to: https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/ and download the appropriate GTF file.