Title: | Miscellaneous tools for genomic sequence analysis |
---|---|
Description: | A miscellaneous toolbox for various genomic sequence analysis tasks. Refer to package vignettes for different topics covered in this package. Part of the y3628 analysis suite. |
Authors: | Ye Yuan [aut, cre] |
Maintainer: | Ye Yuan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.2 |
Built: | 2025-02-13 04:46:31 UTC |
Source: | https://github.com/yeyuan98/gsAnalysis |
System cell of MaxEntScan perl script
.MaxEntScanRun(ps.MES, sequences, ME.dir)
.MaxEntScanRun(ps.MES, sequences, ME.dir)
ps.MES |
What MaxEntScan script to run. |
sequences |
character vector of sequences. |
ME.dir |
Path to the unzipped MaxEntScan directory. |
numeric vector of the MaxEntScan return values.
Count number of overlapping bases
.overlapWidths(query, subject)
.overlapWidths(query, subject)
query |
GenomicRanges::Ganges of query. |
subject |
GenomicRanges::Ganges of subject. |
'overlapping bases' is counted for each query against ALL subject ranges. If a query overlaps with two subject ranges with 5 and 4 bases, the number reported will be 5+4=9. Implementation is:
GenomicRanges::subtract(query, subject)
Count up the width of the resulting GRangesList
(original width of query) - (width of subtracted query)
Integer vector of numbers of overlapping bases. Guaranteed to be the same order as the query ranges.
Stranded version of GenomicRanges::shift
.strandedShift(GRanges.input, shift)
.strandedShift(GRanges.input, shift)
GRanges.input |
GenomicRanges::GRanges |
shift |
how many bp to shift each range. For '+' stranded range, a positive shift will be towards 3' of the genomic coordinate. For '-', a positive shift will be towards 5' of the genome. |
Shifted GenomicRanges::GRanges.
Summarize CIGAR string by addition
bam_summary_cigar(cigar, which, FUNC = sum)
bam_summary_cigar(cigar, which, FUNC = sum)
cigar |
CIGAR strings as character vector |
which |
CIGAR operators to extract |
FUNC |
Summarize function. Accept numeric vector and output length = 1. |
Numeric vector of sums of specified CIGAR operators.
cigar.test <- c( "148H47M1D113M1D34M1D39M460H", "50M" ) bam_summary_cigar( # All Ops that consume reference # Therefore sum = reference lengths cigar.test, which = c("M", "D", "N", "=", "X") )
cigar.test <- c( "148H47M1D113M1D34M1D39M460H", "50M" ) bam_summary_cigar( # All Ops that consume reference # Therefore sum = reference lengths cigar.test, which = c("M", "D", "N", "=", "X") )
Distance from branchpoint to 3' splicing site
BranchPointScan( GRanges.intron, branchpoint.motif, BSgenome, logodds.threshold = 0.5 )
BranchPointScan( GRanges.intron, branchpoint.motif, BSgenome, logodds.threshold = 0.5 )
GRanges.intron |
GenomicRanges::GRanges of the introns. |
branchpoint.motif |
An universalmotif::universalmotif-class object
representing the branch point. Can be loaded using universalmotif::read_*
methods. For example, |
BSgenome |
BSgenome object from Bioconductor. |
logodds.threshold |
logodds threshold used by
|
A numeric vector in order of ranges of GRanges.intron. Ranges that do not have identified branchpoint will take NA values.
vignette("intron-properties")
vignette("intron-properties")
Read IRFinder-S output of a single sample
IRFinderS_read(result.dir, type.samples = "validated")
IRFinderS_read(result.dir, type.samples = "validated")
result.dir |
IRFinder-S output directory |
type.samples |
Which summary type to read in, see details |
Two sample types are output by IRFinder-S: validated
and full
.
validated
: read the "IRFinder-IR-nondir-val.txt"
full
: read the "IRFinder-IR-nondir.txt"
tibble of IRFinder-S results
#TODO
#TODO
Report a "merged" SummarizedExperiment for differential analysis. Refer to details section.
IRFinderS_readSamples( named.result.dirs, score.column = "IR.ratio", join.column = c("chr", "start", "end", "strand", "symbol"), type.samples = "validated", min.samples = 3, wl = 0 )
IRFinderS_readSamples( named.result.dirs, score.column = "IR.ratio", join.column = c("chr", "start", "end", "strand", "symbol"), type.samples = "validated", min.samples = 3, wl = 0 )
named.result.dirs |
Named paths to result directories, see description. |
score.column |
Which one column to read in as score. |
join.column |
Which columns define an intron. |
type.samples |
Forwarded to |
min.samples |
How to filter retained introns, see description. |
wl |
How to filter retained introns, see description. |
The routine is divided into the following steps:
First, read in retained introns from each sample by IRFinderS_read
.
Introns that have warnings defined by wl
is removed from each sample.
wl
definition follows that of the IRFinder Diff
routine.
Unique introns are defined by columns specified in join.column
.
Next, consensus introns are determined. If an intron is found in at least
min.samples
of samples it is considered as a consensus intron.
Finally, intron scores from the 'full' data table are extracted, which will be put as the "scores" assay in the output SE object.
named.result.dirs
must be a named character vector whose:
names = group name
values = full path to the sample result directories
SummarizedExperiment
object.
#TODO
#TODO
Splicing site strength scoring by MaxEntScan
MaxEntScan(path.zip.MES = "burgelab.maxent.zip", BSgenome, GRange.intron)
MaxEntScan(path.zip.MES = "burgelab.maxent.zip", BSgenome, GRange.intron)
path.zip.MES |
Path to the MaxEntScan perl program (zipped). While not provided by the package, you may obtain a copy from the original MaxEntScan author. Alternatively, you may get an archived copy from Github. |
BSgenome |
BSgenome object from Bioconductor. |
GRange.intron |
GenomicRanges::GRanges of the introns. Genome must match that of the BSgenome object. This intron ranges must be stranded (i.e., only contains '+' and '-' strand values.) |
data.frame with the following columns. Rows are guaranteed to match order of the input introns.
MaxEnt.5ss, score for the 5' splicing site
MaxEnt.3ss, score for the 3' splicing site
vignette("intron-properties")
vignette("intron-properties")
Conservation scoring by phastCons
phastCons(GRange.intron, bw.phastCons.path, bed.phastCons.path)
phastCons(GRange.intron, bw.phastCons.path, bed.phastCons.path)
GRange.intron |
GenomicRanges::GRanges of the introns. |
bw.phastCons.path |
Path to a bigwig file of phastCons scores. This
may be retrieved from the UCSC (e.g., dm6). Look for
the |
bed.phastCons.path |
Path to a bed file of conserved regions annotated
by phastCons. This may be retrieved from the UCSC (e.g., dm6). Look for the
|
This function can be used for compute scores for any GRanges of interest.
data.frame. Number of rows is the same as number of ranges of
GRange.intron
. Results have the following columns
Mean phastCons values over each range
Percentage of bases in conserved phastCons elements for each range
vignette("intron-properties")
vignette("intron-properties")
Intron length ratio to mean neighboring exons
RIME( txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene, GRange.intron )
RIME( txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene, GRange.intron )
txdb |
Bioconductor TxDb. Must match the intron GRanges. |
GRange.intron |
GenomicRanges::GRanges of the introns. |
Numeric vector of the ratios.
vignette("intron-properties")
vignette("intron-properties")
Filtering rMATS output of different AS patterns
rmats_filter(df, supp_reads = c(5, 2), incLvl_limits = c(0.05, 0.95))
rmats_filter(df, supp_reads = c(5, 2), incLvl_limits = c(0.05, 0.95))
df |
Data frame of parsed rMATS output. Use |
supp_reads |
Supporting read filter. See details. |
incLvl_limits |
Inclusion level filter. See details. |
Typically, it is desirable to filter the rMATS output to:
Reject detected AS events that have too few supporting reads.
Remove AS events whose inclusion levels are extreme.
This function applies the following filters (default parameters assumed):
Supporting read count (both EJC and IJC) must be >=5 in >=2 samples.
Inclusion level must be in the range of 0.05-0.95.
Filtered rMATS output. See details.
#TODO
#TODO
Read rMATS output of different AS patterns
rmats_read(outputs.dir, method)
rmats_read(outputs.dir, method)
outputs.dir |
Output directory of rMATS. |
method |
Either "JC" or "JCEC", see details. |
This is a convenience function for reading rMATS output, merging different splicing patterns into a single data frame for easier further analysis. Refer to rMATS Github for details on results generated by rMATS: https://github.com/Xinglab/rmats-turbo/blob/v4.3.0/README.md#output
rMATS normalizes lengths of individual splicing variants. There are two methods used for this normalization: JC and JCEC. Refer to the rMATS paper for more details: https://doi.org/10.1073/pnas.1419161111
rMATS coordinated are 0-based; the exact meaning of start and end varies by splicing pattern type:
A3SS/A5SS: start/end = start/end of the long exon (inclusion form).
SE: start/end = start/end of the skipped exon (inclusion form).
MXE: start/end = start of the first exon / end of the second exon.
RI: start/end = start/end of the retained intron (inclusion form).
Data frame of rMATS output
# TODO
# TODO
Convenience function for converting rMATS data to GRanges.
rmats_toGRange(df, ...)
rmats_toGRange(df, ...)
df |
rMATS data read by |
... |
< Three columns are used to fill in GRanges information: Two columns are by default added to the metadata: Extra metadata columns are specified by the tidyselect ellipsis. |
GRanges object
#TODO
#TODO
Write a XStringSet to file (FASTA/FASTQ) with names as identifiers.
writeXStringSetNamed(x, filepath, append = FALSE, compress = FALSE, ...)
writeXStringSetNamed(x, filepath, append = FALSE, compress = FALSE, ...)
x |
Object XString to write to file |
filepath |
File to write to |
append |
Must be False (does not support append) |
compress |
Must be False (does not support compression) |
... |
Forwarded to |
Somehow the original Biostrings::writeXStringSet()
does not write
identifiers. This function uses names provided by name()
and write ID.
#TODO
#TODO