Usage¶
SERVE is composed of transcript assembly (SERVE.py), annotation merge (SERVE_merge.py), and gene quantification (SERVE_quant.py + SERVE_quant_QC.py).
Input files¶
RNA sequencing files: paired end, in FASTQ and compressed format (fastq.gz)
ERV reference bed: ERV annotation in BED format (at least 6 fields: chrom, chromStart, chromEnd, name, score, and strand). See Annotation/HERV_noncoding.bed
Reference genome: reference genome sequence (FASTA) is required. You can create genome index before running SERVE, or provide reference annotation (GTF) for SERVE to automatically create index.
Note: all directories and files should be provided with absolute paths.
SERVE.py¶
SERVE.py is designed for ERV transcript assembly in a single sample.
Example:
1. SERVE.py -fq1 ${Dir}/test_1.fastq.gz -fq2 ${Dir}/test_2.fastq.gz -e ${Dir}/ERV_ref.bed -p test -r ${Dir}/GRCh38.fa -a ${Dir}/GRCh38.gtf -t 16
2. SERVE.py -fq1 ${Dir}/test_1.fastq.gz -fq2 ${Dir}/test_2.fastq.gz -e ${Dir}/ERV_ref.bed -p test -S ${STAR_index_Dir} -G ${GMAP_index_Dir} -g GMAP -t 16
usage: SERVE.py [-h] [-fq1 FASTQ1] [-fq2 FASTQ2] [-e ERV_BED] [-p PREFIX]
[-r REF_GENOME] [-a ANNOTATION]
[--genomeSAindexNbases GENOMESAINDEXNBASES] [-S STAR_INDEX]
[-G GMAP_INDEX] [-g GMAP_INDEX_NAME] [-t NTHREAD]
[--nthreadsort NTHREADSORT] [--nRAMsort NRAMSORT]
[-s STRANDED_TYPE] [-m NRAMASSEM] [--max_intron MAX_INTRON]
[--min_identity MIN_IDENTITY] [--count COUNT] [-o OUTPUT_DIR]
SERVE: pipeline for detecting expressed ERVs
optional arguments:
-h, --help show this help message and exit
-fq1 FASTQ1, --fastq1 FASTQ1
Read1 in FASTQ format (required)
-fq2 FASTQ2, --fastq2 FASTQ2
Read1 in FASTQ format (required)
-e ERV_BED, --erv_bed ERV_BED
ERV position in BED format (required)
-p PREFIX, --prefix PREFIX
Prefix for output file name (default: SERVE)
-r REF_GENOME, --ref_genome REF_GENOME
Reference genome in FASTA format (required)
-a ANNOTATION, --annotation ANNOTATION
Genome annotation in GTF format
--genomeSAindexNbases GENOMESAINDEXNBASES
Length (bases) of the SA pre-indexing string for
creating STAR index. Typically between 10 and 15. For
small genomes, this parameter must be scaled down to
min(14, log2(GenomeLength)/2-1)
-S STAR_INDEX, --STAR_index STAR_INDEX
Path to the directory where STAR index is generated
(default: STAR_index)
-G GMAP_INDEX, --GMAP_index GMAP_INDEX
Path to the directory where GMAP index is generated
(default: GMAP_index)
-g GMAP_INDEX_NAME, --GMAP_index_name GMAP_INDEX_NAME
GMAP index name (default: GRCh38)
-t NTHREAD, --nthread NTHREAD
Number of threads to run SERVE (default: 1)
--nthreadsort NTHREADSORT
Number of threads for BAM sorting
--nRAMsort NRAMSORT Maximum available RAM (bytes) for sorting BAM
-s STRANDED_TYPE, --stranded_type STRANDED_TYPE
Strand-specific RNA-seq read orientation: RF or FR
(default: None)
-m NRAMASSEM, --nRAMassem NRAMASSEM
Maximum available RAM (GB) for assembly (default: 10G)
--max_intron MAX_INTRON
Maximum intron length of ERVs (default: 10000)
--min_identity MIN_IDENTITY
Minimum identity of ERV transcripts (default: 0.96)
--count COUNT Minimum ERV count (default: 5)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory (default: .)
In the output directory, files are organized into three directories:
1_align/for alignment2_assem/for assembly3_qc/for quality control
${prefix}_ERV.gtf (in 3_qc/) records ERV information in one sample and can be used by SERVE_merge.py.
SERVE_merge.py¶
SERVE_merge.py is designed for annotation merge, suitable for single sample or multiple samples in one condition.
Example:
ls *gtf > gtf.list
SERVE_merge.py -i ${Dir}/gtf.list -p test -n 10 -r ${Dir}/GRCh38.fa -t 16
usage: SERVE_merge.py [-h] [-i INPUT_GTF_LIST] [-p PREFIX] [--taco TACO]
[--stringtie STRINGTIE] [-n NSAMPLE] [-r REF_GENOME]
[-G GMAP_INDEX] [-g GMAP_INDEX_NAME] [-t NTHREAD]
[-l LENGTH] [--ratio RATIO] [-o OUTPUT_DIR]
SERVE_merge: merge expressed ERVs
optional arguments:
-h, --help show this help message and exit
-i INPUT_GTF_LIST, --input_gtf_list INPUT_GTF_LIST
A text file with a list of SERVE GTF files (required)
-p PREFIX, --prefix PREFIX
Prefix for output file name (default: SERVE)
--taco TACO Merge ERV transcripts with TACO (default: FALSE)
--stringtie STRINGTIE
Merge ERV transcripts with StringTie (default: FALSE)
-n NSAMPLE, --nsample NSAMPLE
The number of samples included in the input sample list (required)
-r REF_GENOME, --ref_genome REF_GENOME
Reference genome in FASTA format (required)
-G GMAP_INDEX, --GMAP_index GMAP_INDEX
Path to the directory where GMAP index is generated
(default: GMAP_index)
-g GMAP_INDEX_NAME, --GMAP_index_name GMAP_INDEX_NAME
GMAP index name (default: GRCh38)
-t NTHREAD, --nthread NTHREAD
Number of threads to run SERVE_merge (default: 1)
-l LENGTH, --length LENGTH
Minimum ERV length (bp) (default: 200)
--ratio RATIO Minimum sample ratio of ERV identified (default: 0.50)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory (default: .)
${prefix}_ERV_merge.gtf in the output directory records ERV gene information for a condition and can be used for downstream quantification.
SERVE_quant.py¶
SERVE_quant.py is designed for ERV quantification in a single sample.
Example:
cat ${Dir}/GRCh38.gtf ${Dir}/${prefix}_ERV_merge.gtf > ${Dir}/GRCh38_ERV.gtf
SERVE_quant.py -fq1 ${Dir}/test_1.fastq.gz -fq2 ${Dir}/test_2.fastq.gz -p test -r ${Dir}/GRCh38.fa -a ${Dir}/GRCh38_ERV.gtf -t 16
usage: SERVE_quant.py [-h] [-fq1 FASTQ1] [-fq2 FASTQ2] [-p PREFIX]
[-r REF_GENOME] [-a ANNOTATION] [-R RSEM_INDEX]
[-t NTHREAD] [-f FORWARD_PROB] [-o OUTPUT_DIR]
SERVE_quant: Quantify expressed ERVs
optional arguments:
-h, --help show this help message and exit
-fq1 FASTQ1, --fastq1 FASTQ1
Read1 in FASTQ format (required)
-fq2 FASTQ2, --fastq2 FASTQ2
Read1 in FASTQ format (required)
-p PREFIX, --prefix PREFIX
Prefix for output file name (default: SERVE)
-r REF_GENOME, --ref_genome REF_GENOME
Reference genome in FASTA format (required)
-a ANNOTATION, --annotation ANNOTATION
Genome annotation in GTF format (required)
-R RSEM_INDEX, --RSEM_index RSEM_INDEX
Path to the directory where RSEM index is generated
(default: RSEM_index)
-t NTHREAD, --nthread NTHREAD
Number of threads to run SERVE (default: 1)
-f FORWARD_PROB, --forward_prob FORWARD_PROB
Probability of generating a read from the forward
strand of a transcript. 0.5 for unstranded-specific, 0
for stranded-specific where upstream reads are all
derived from the forward strand, 1 for stranded-specific
where upstream reads are all derived from the reverse
strand (default: 0.5)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory (default: .)
${prefix}.genes.results in the output directory records gene quantification results. If you have biological replicates, run SERVE_quant_QC.py for quality control.
SERVE_quant_QC.py¶
SERVE_quant_QC.py merges gene quantification results (from SERVE_quant.py) and performs quality control.
Example:
ls *genes.results > sample.list
SERVE_quant_QC.py -i ${Dir}/sample.list -p test
usage: SERVE_quant_QC.py [-h] [-i INPUT_SAMPLE_LIST] [-p PREFIX]
[--count COUNT] [--TPM TPM] [--ratio RATIO]
[-o OUTPUT_DIR]
SERVE_quant_QC: Quality control expressed ERVs
optional arguments:
-h, --help show this help message and exit
-i INPUT_SAMPLE_LIST, --input_sample_list INPUT_SAMPLE_LIST
A text file with a list of SERVE quant files (required)
-p PREFIX, --prefix PREFIX
Prefix for output file name (default: SERVE)
--count COUNT Minimum ERV count (default: 5)
--TPM TPM Minimum ERV TPM (default: 0.1)
--ratio RATIO Minimum sample ratio of ERV identified (default: 0.50)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory (default: .)
${prefix}_ERV_count.txt and ${prefix}_ERV_TPM.txt record gene read counts and TPM, respectively.
Default QC thresholds: count > 5 and TPM >= 0.1 in over 50% samples.