| Title: | Miscellaneous tools for genomic sequence analysis |
|---|---|
| Description: | A miscellaneous toolbox for various genomic sequence analysis tasks. Refer to package vignettes for different topics covered in this package. Part of the y3628 analysis suite. |
| Authors: | Ye Yuan [aut, cre] (ORCID: <https://orcid.org/0000-0001-9641-9102>) |
| Maintainer: | Ye Yuan <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.0.2 |
| Built: | 2026-06-04 09:01:07 UTC |
| Source: | https://github.com/yeyuan98/gsAnalysis |
System cell of MaxEntScan perl script
.MaxEntScanRun(ps.MES, sequences, ME.dir).MaxEntScanRun(ps.MES, sequences, ME.dir)
ps.MES |
What MaxEntScan script to run. |
sequences |
character vector of sequences. |
ME.dir |
Path to the unzipped MaxEntScan directory. |
numeric vector of the MaxEntScan return values.
Count number of overlapping bases
.overlapWidths(query, subject).overlapWidths(query, subject)
query |
GenomicRanges::Ganges of query. |
subject |
GenomicRanges::Ganges of subject. |
'overlapping bases' is counted for each query against ALL subject ranges. If a query overlaps with two subject ranges with 5 and 4 bases, the number reported will be 5+4=9. Implementation is:
GenomicRanges::subtract(query, subject)
Count up the width of the resulting GRangesList
(original width of query) - (width of subtracted query)
Integer vector of numbers of overlapping bases. Guaranteed to be the same order as the query ranges.
Stranded version of GenomicRanges::shift
.strandedShift(GRanges.input, shift).strandedShift(GRanges.input, shift)
GRanges.input |
GenomicRanges::GRanges |
shift |
how many bp to shift each range. For '+' stranded range, a positive shift will be towards 3' of the genomic coordinate. For '-', a positive shift will be towards 5' of the genome. |
Shifted GenomicRanges::GRanges.
Summarize CIGAR string by addition
bam_summary_cigar(cigar, which, FUNC = sum)bam_summary_cigar(cigar, which, FUNC = sum)
cigar |
CIGAR strings as character vector |
which |
CIGAR operators to extract |
FUNC |
Summarize function. Accept numeric vector and output length = 1. |
Numeric vector of sums of specified CIGAR operators.
cigar.test <- c( "148H47M1D113M1D34M1D39M460H", "50M" ) bam_summary_cigar( # All Ops that consume reference # Therefore sum = reference lengths cigar.test, which = c("M", "D", "N", "=", "X") )cigar.test <- c( "148H47M1D113M1D34M1D39M460H", "50M" ) bam_summary_cigar( # All Ops that consume reference # Therefore sum = reference lengths cigar.test, which = c("M", "D", "N", "=", "X") )
Distance from branchpoint to 3' splicing site
BranchPointScan( GRanges.intron, branchpoint.motif, BSgenome, logodds.threshold = 0.5 )BranchPointScan( GRanges.intron, branchpoint.motif, BSgenome, logodds.threshold = 0.5 )
GRanges.intron |
GenomicRanges::GRanges of the introns. |
branchpoint.motif |
An universalmotif::universalmotif-class object
representing the branch point. Can be loaded using universalmotif::read_*
methods. For example, |
BSgenome |
BSgenome object from Bioconductor. |
logodds.threshold |
logodds threshold used by
|
A numeric vector in order of ranges of GRanges.intron. Ranges that do not have identified branchpoint will take NA values.
vignette("intron-properties")vignette("intron-properties")
Differential analysis with beta regression
diff_breg(se, design, target, n.cores = "auto")diff_breg(se, design, target, n.cores = "auto")
se |
SummarizedExperiment containing data to test. |
design |
Design formula. |
target |
Which target to perform test. |
n.cores |
Multicore by |
data.frame of test results with FDR.
# TODO# TODO
This takes data.frame input.
diff_breg_single(data, design, target, check = TRUE)diff_breg_single(data, design, target, check = TRUE)
data |
Data to use. |
design |
Design formula of the problem. |
target |
Which target to test. |
check |
Whether to perform sanity checks. |
data.frame of test results.
# TODO.# TODO.
Read IRFinder-S output of a single sample
IRFinderS_read(result.dir, type.samples = "validated")IRFinderS_read(result.dir, type.samples = "validated")
result.dir |
IRFinder-S output directory |
type.samples |
Which summary type to read in, see details |
Two sample types are output by IRFinder-S: validated and full.
validated: read the "IRFinder-IR-nondir-val.txt"
full: read the "IRFinder-IR-nondir.txt"
For column names please refer to the function body
tibble of IRFinder-S results
#TODO#TODO
Report a "merged" SummarizedExperiment for differential analysis. Refer to details section.
IRFinderS_readSamples( named.result.dirs, assay.columns = c("IR.ratio", "depth", "n.reads.spliceExact"), join.columns = c("chr", "start", "end", "strand", "symbol"), type.samples = "validated", min.samples = 3, wl = 0 )IRFinderS_readSamples( named.result.dirs, assay.columns = c("IR.ratio", "depth", "n.reads.spliceExact"), join.columns = c("chr", "start", "end", "strand", "symbol"), type.samples = "validated", min.samples = 3, wl = 0 )
named.result.dirs |
Named paths to result directories, see description. |
assay.columns |
Which column(s) to read in as assays of the output. |
join.columns |
Which columns define an intron. |
type.samples |
Forwarded to |
min.samples |
How to filter retained introns, see description. |
wl |
How to filter retained introns, see description. |
The routine is divided into the following steps:
First, read in retained introns from each sample by IRFinderS_read.
named.result.dirs must be a named character vector whose:
names = group name
values = full path to the sample result directories
Introns that have warnings defined by wl is removed from each sample.
wl definition follows that of the IRFinder Diff routine.
Unique introns are defined by columns specified in join.columns.
Next, consensus introns are determined. If an intron is found in at least
min.samples of samples it is considered as a consensus intron.
Finally, intron data from the 'full' data table are extracted,
which will be put as assays in the output SE object. Which data columns are
included is defined by assay.columns.
wl filteringThis filtering determines whether an intron is included in a sample. It looks at the warning flags by IRFinder and possible values are:
0 -> no filter; 1 -> LowCover introns removed; 2 -> Lowcover & LowSplicing; 3 -> LowCover & LowSplicing & MinorIsoform; 4 -> LowCover & LowSplicing & MinorIsoform & NonUniformIntronCover
SummarizedExperiment object of IRFinder results.
#TODO#TODO
Splicing site strength scoring by MaxEntScan
MaxEntScan(path.zip.MES = "burgelab.maxent.zip", BSgenome, GRange.intron)MaxEntScan(path.zip.MES = "burgelab.maxent.zip", BSgenome, GRange.intron)
path.zip.MES |
Path to the MaxEntScan perl program (zipped). While not provided by the package, you may obtain a copy from the original MaxEntScan author. Alternatively, you may get an archived copy from Github. |
BSgenome |
BSgenome object from Bioconductor. |
GRange.intron |
GenomicRanges::GRanges of the introns. Genome must match that of the BSgenome object. This intron ranges must be stranded (i.e., only contains '+' and '-' strand values.) |
data.frame with the following columns. Rows are guaranteed to match order of the input introns.
MaxEnt.5ss, score for the 5' splicing site
MaxEnt.3ss, score for the 3' splicing site
vignette("intron-properties")vignette("intron-properties")
Conservation scoring by phastCons
phastCons(GRange.intron, bw.phastCons.path, bed.phastCons.path)phastCons(GRange.intron, bw.phastCons.path, bed.phastCons.path)
GRange.intron |
GenomicRanges::GRanges of the introns. |
bw.phastCons.path |
Path to a bigwig file of phastCons scores. This
may be retrieved from the UCSC (e.g., dm6). Look for
the |
bed.phastCons.path |
Path to a bed file of conserved regions annotated
by phastCons. This may be retrieved from the UCSC (e.g., dm6). Look for the
|
This function can be used for compute scores for any GRanges of interest.
data.frame. Number of rows is the same as number of ranges of
GRange.intron. Results have the following columns
Mean phastCons values over each range
Percentage of bases in conserved phastCons elements for each range
vignette("intron-properties")vignette("intron-properties")
Read STAR Aligner Gene Counts
readStarGeneCounts(named_paths, clean = FALSE)readStarGeneCounts(named_paths, clean = FALSE)
named_paths |
Named list of count table paths. |
clean |
Boolean, whether to remove entries starting with "N_". |
SummarizedExperiment of counts. Metadata sample will be names.
# TODO# TODO
Intron length ratio to mean neighboring exons
RIME( txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene, GRange.intron )RIME( txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene, GRange.intron )
txdb |
Bioconductor TxDb. Must match the intron GRanges. |
GRange.intron |
GenomicRanges::GRanges of the introns. |
Numeric vector of the ratios.
vignette("intron-properties")vignette("intron-properties")
Assign filter labels of rMATS output
rmats_filter( df, readCov = 10, maxPSI = 0.95, minPSI = 0.05, sigFDR = 0.1, sigDeltaPSI = 0.05, res_column = "filter" )rmats_filter( df, readCov = 10, maxPSI = 0.95, minPSI = 0.05, sigFDR = 0.1, sigDeltaPSI = 0.05, res_column = "filter" )
df |
rMATS data read by |
readCov |
Minimum read coverage. |
maxPSI |
Maximum percent spliced-in PSI. |
minPSI |
Minimum PSI. |
sigFDR |
FDR threshold for significance. |
sigDeltaPSI |
PSI threshold for significance. |
res_column |
Which column to store filter results |
This function reproduces the data filter demonstrated in Xing Lab 2024 Nature Protocol tutorial. See Github repository Xinglab/rmats-turbo-tutorial and refer to rmats_filtering.py.
AS events that does not satisfy any of the following criteria will be given the remove label:
Average read counts (IJC+SJC) in both Sample 1 and Sample 2 conditions
are at least readCov.
Average PSI in both Sample 1 and Sample 2 conditions are between
minPSI and maxPSI.
If all criteria above are satisfied, events will be labeled by statistical significance:
FDR < sigFDR and IncLevelDifference > sigDeltaPSI: up.
FDR < sigFDR and IncLevelDifference < -sigDeltaPSI: down.
Otherwise: ns.
rMATS data table with filter label column filter.
# TODO# TODO
Read rMATS output of different AS patterns
rmats_read(outputs.dir, method)rmats_read(outputs.dir, method)
outputs.dir |
Output directory of rMATS. |
method |
Either "JC" or "JCEC", see details. |
This is a convenience function for reading rMATS output, merging different splicing patterns into a single data frame for easier further analysis. Refer to rMATS Github for details on results generated by rMATS: https://github.com/Xinglab/rmats-turbo/blob/v4.3.0/README.md#output
rMATS normalizes lengths of individual splicing variants. There are two methods used for this normalization: JC and JCEC. Refer to the rMATS paper for more details: https://doi.org/10.1073/pnas.1419161111
rMATS coordinated are 0-based; the exact meaning of start and end varies by splicing pattern type:
A3SS/A5SS: start/end = start/end of the long exon (inclusion form).
SE: start/end = start/end of the skipped exon (inclusion form).
MXE: start/end = start of the first exon / end of the second exon.
RI: start/end = start/end of the retained intron (inclusion form).
Data frame of rMATS output
# TODO# TODO
Convenience function for converting rMATS data to GRanges.
rmats_toGRange(df, ...)rmats_toGRange(df, ...)
df |
rMATS data read by |
... |
< Three columns are used to fill in GRanges information: Two columns are by default added to the metadata: Extra metadata columns are specified by the tidyselect ellipsis. |
GRanges object
#TODO#TODO
Subset columns using expr_colData.
Evaluate expr_assay on assays() of the subset.
Apply function fn_rowwise to get subset vector.
Use the subset vector to subset rows of the input se.
se_subset(se, expr_assay, expr_colData, fn_rowwise)se_subset(se, expr_assay, expr_colData, fn_rowwise)
se |
SummarizedExperiment object. |
expr_assay |
Expression to evaluate on assays |
expr_colData |
Expression to evaluate on colData |
fn_rowwise |
Predicate to define rowwise filter |
Subset SummarizedExperiment object.
# TODO# TODO
Write a XStringSet to file (FASTA/FASTQ) with names as identifiers.
writeXStringSetNamed(x, filepath, append = FALSE, compress = FALSE, ...)writeXStringSetNamed(x, filepath, append = FALSE, compress = FALSE, ...)
x |
Object XString to write to file |
filepath |
File to write to |
append |
Must be False (does not support append) |
compress |
Must be False (does not support compression) |
... |
Forwarded to |
Somehow the original Biostrings::writeXStringSet() does not write
identifiers. This function uses names provided by name() and write ID.
#TODO#TODO