Package 'gsAnalysis'

Title: Miscellaneous tools for genomic sequence analysis
Description: A miscellaneous toolbox for various genomic sequence analysis tasks. Refer to package vignettes for different topics covered in this package. Part of the y3628 analysis suite.
Authors: Ye Yuan [aut, cre] (ORCID: <https://orcid.org/0000-0001-9641-9102>)
Maintainer: Ye Yuan <[email protected]>
License: GPL (>= 3)
Version: 0.0.2
Built: 2026-06-04 09:01:07 UTC
Source: https://github.com/yeyuan98/gsAnalysis

Help Index


System cell of MaxEntScan perl script

Description

System cell of MaxEntScan perl script

Usage

.MaxEntScanRun(ps.MES, sequences, ME.dir)

Arguments

ps.MES

What MaxEntScan script to run.

sequences

character vector of sequences.

ME.dir

Path to the unzipped MaxEntScan directory.

Value

numeric vector of the MaxEntScan return values.


Count number of overlapping bases

Description

Count number of overlapping bases

Usage

.overlapWidths(query, subject)

Arguments

query

GenomicRanges::Ganges of query.

subject

GenomicRanges::Ganges of subject.

Details

'overlapping bases' is counted for each query against ALL subject ranges. If a query overlaps with two subject ranges with 5 and 4 bases, the number reported will be 5+4=9. Implementation is:

  1. GenomicRanges::subtract(query, subject)

  2. Count up the width of the resulting GRangesList

  3. (original width of query) - (width of subtracted query)

Value

Integer vector of numbers of overlapping bases. Guaranteed to be the same order as the query ranges.


Stranded version of GenomicRanges::shift

Description

Stranded version of GenomicRanges::shift

Usage

.strandedShift(GRanges.input, shift)

Arguments

GRanges.input

GenomicRanges::GRanges

shift

how many bp to shift each range. For '+' stranded range, a positive shift will be towards 3' of the genomic coordinate. For '-', a positive shift will be towards 5' of the genome.

Value

Shifted GenomicRanges::GRanges.


Summarize CIGAR string by addition

Description

Summarize CIGAR string by addition

Usage

bam_summary_cigar(cigar, which, FUNC = sum)

Arguments

cigar

CIGAR strings as character vector

which

CIGAR operators to extract

FUNC

Summarize function. Accept numeric vector and output length = 1.

Value

Numeric vector of sums of specified CIGAR operators.

Examples

cigar.test <- c(
  "148H47M1D113M1D34M1D39M460H",
  "50M"
)
bam_summary_cigar(
  # All Ops that consume reference
  # Therefore sum = reference lengths
  cigar.test, which = c("M", "D", "N", "=", "X")
)

Distance from branchpoint to 3' splicing site

Description

Distance from branchpoint to 3' splicing site

Usage

BranchPointScan(
  GRanges.intron,
  branchpoint.motif,
  BSgenome,
  logodds.threshold = 0.5
)

Arguments

GRanges.intron

GenomicRanges::GRanges of the introns.

branchpoint.motif

An universalmotif::universalmotif-class object representing the branch point. Can be loaded using universalmotif::read_* methods. For example, universalmotif::read_meme().

BSgenome

BSgenome object from Bioconductor.

logodds.threshold

logodds threshold used by universalmotif::scan_sequences().

Value

A numeric vector in order of ranges of GRanges.intron. Ranges that do not have identified branchpoint will take NA values.

Examples

vignette("intron-properties")

Differential analysis with beta regression

Description

Differential analysis with beta regression

Usage

diff_breg(se, design, target, n.cores = "auto")

Arguments

se

SummarizedExperiment containing data to test.

design

Design formula.

target

Which target to perform test.

n.cores

Multicore by parallel. Either "auto", integer, or NA. NA means no parallel processing.

Value

data.frame of test results with FDR.

Examples

# TODO

Differential analysis with beta regression

Description

This takes data.frame input.

Usage

diff_breg_single(data, design, target, check = TRUE)

Arguments

data

Data to use.

design

Design formula of the problem.

target

Which target to test.

check

Whether to perform sanity checks.

Value

data.frame of test results.

Examples

# TODO.

Read IRFinder-S output of a single sample

Description

Read IRFinder-S output of a single sample

Usage

IRFinderS_read(result.dir, type.samples = "validated")

Arguments

result.dir

IRFinder-S output directory

type.samples

Which summary type to read in, see details

Details

Two sample types are output by IRFinder-S: validated and full.

  • validated: read the "IRFinder-IR-nondir-val.txt"

  • full: read the "IRFinder-IR-nondir.txt"

For column names please refer to the function body

Value

tibble of IRFinder-S results

Examples

#TODO

Read IRFinder-S output of multiple samples

Description

Report a "merged" SummarizedExperiment for differential analysis. Refer to details section.

Usage

IRFinderS_readSamples(
  named.result.dirs,
  assay.columns = c("IR.ratio", "depth", "n.reads.spliceExact"),
  join.columns = c("chr", "start", "end", "strand", "symbol"),
  type.samples = "validated",
  min.samples = 3,
  wl = 0
)

Arguments

named.result.dirs

Named paths to result directories, see description.

assay.columns

Which column(s) to read in as assays of the output.

join.columns

Which columns define an intron.

type.samples

Forwarded to IRFinderS_read().

min.samples

How to filter retained introns, see description.

wl

How to filter retained introns, see description.

Details

The routine is divided into the following steps:

First, read in retained introns from each sample by IRFinderS_read.

named.result.dirs must be a named character vector whose:

  • names = group name

  • values = full path to the sample result directories

Introns that have warnings defined by wl is removed from each sample. wl definition follows that of the ⁠IRFinder Diff⁠ routine.

Unique introns are defined by columns specified in join.columns.

Next, consensus introns are determined. If an intron is found in at least min.samples of samples it is considered as a consensus intron.

Finally, intron data from the 'full' data table are extracted, which will be put as assays in the output SE object. Which data columns are included is defined by assay.columns.

wl filtering

This filtering determines whether an intron is included in a sample. It looks at the warning flags by IRFinder and possible values are:

0 -> no filter; 1 -> LowCover introns removed; 2 -> Lowcover & LowSplicing; 3 -> LowCover & LowSplicing & MinorIsoform; 4 -> LowCover & LowSplicing & MinorIsoform & NonUniformIntronCover

Value

SummarizedExperiment object of IRFinder results.

Examples

#TODO

Splicing site strength scoring by MaxEntScan

Description

Splicing site strength scoring by MaxEntScan

Usage

MaxEntScan(path.zip.MES = "burgelab.maxent.zip", BSgenome, GRange.intron)

Arguments

path.zip.MES

Path to the MaxEntScan perl program (zipped). While not provided by the package, you may obtain a copy from the original MaxEntScan author. Alternatively, you may get an archived copy from Github.

BSgenome

BSgenome object from Bioconductor.

GRange.intron

GenomicRanges::GRanges of the introns. Genome must match that of the BSgenome object. This intron ranges must be stranded (i.e., only contains '+' and '-' strand values.)

Value

data.frame with the following columns. Rows are guaranteed to match order of the input introns.

  • MaxEnt.5ss, score for the 5' splicing site

  • MaxEnt.3ss, score for the 3' splicing site

Examples

vignette("intron-properties")

Conservation scoring by phastCons

Description

Conservation scoring by phastCons

Usage

phastCons(GRange.intron, bw.phastCons.path, bed.phastCons.path)

Arguments

GRange.intron

GenomicRanges::GRanges of the introns.

bw.phastCons.path

Path to a bigwig file of phastCons scores. This may be retrieved from the UCSC (e.g., dm6). Look for the dm6.phastCons124way.bw file.

bed.phastCons.path

Path to a bed file of conserved regions annotated by phastCons. This may be retrieved from the UCSC (e.g., dm6). Look for the phastConsElements124way.txt.gz file.

Details

This function can be used for compute scores for any GRanges of interest.

Value

data.frame. Number of rows is the same as number of ranges of GRange.intron. Results have the following columns

mean

Mean phastCons values over each range

perc.in.element

Percentage of bases in conserved phastCons elements for each range

Examples

vignette("intron-properties")

Read STAR Aligner Gene Counts

Description

Read STAR Aligner Gene Counts

Usage

readStarGeneCounts(named_paths, clean = FALSE)

Arguments

named_paths

Named list of count table paths.

clean

Boolean, whether to remove entries starting with "N_".

Value

SummarizedExperiment of counts. Metadata sample will be names.

Examples

# TODO

Intron length ratio to mean neighboring exons

Description

Intron length ratio to mean neighboring exons

Usage

RIME(
  txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene,
  GRange.intron
)

Arguments

txdb

Bioconductor TxDb. Must match the intron GRanges.

GRange.intron

GenomicRanges::GRanges of the introns.

Value

Numeric vector of the ratios.

Examples

vignette("intron-properties")

Assign filter labels of rMATS output

Description

Assign filter labels of rMATS output

Usage

rmats_filter(
  df,
  readCov = 10,
  maxPSI = 0.95,
  minPSI = 0.05,
  sigFDR = 0.1,
  sigDeltaPSI = 0.05,
  res_column = "filter"
)

Arguments

df

rMATS data read by rmats_read().

readCov

Minimum read coverage.

maxPSI

Maximum percent spliced-in PSI.

minPSI

Minimum PSI.

sigFDR

FDR threshold for significance.

sigDeltaPSI

PSI threshold for significance.

res_column

Which column to store filter results

Details

This function reproduces the data filter demonstrated in Xing Lab 2024 Nature Protocol tutorial. See Github repository Xinglab/rmats-turbo-tutorial and refer to rmats_filtering.py.

AS events that does not satisfy any of the following criteria will be given the remove label:

  1. Average read counts (IJC+SJC) in both Sample 1 and Sample 2 conditions are at least readCov.

  2. Average PSI in both Sample 1 and Sample 2 conditions are between minPSI and maxPSI.

If all criteria above are satisfied, events will be labeled by statistical significance:

  1. FDR < sigFDR and IncLevelDifference > sigDeltaPSI: up.

  2. FDR < sigFDR and IncLevelDifference < -sigDeltaPSI: down.

  3. Otherwise: ns.

Value

rMATS data table with filter label column filter.

Examples

# TODO

Read rMATS output of different AS patterns

Description

Read rMATS output of different AS patterns

Usage

rmats_read(outputs.dir, method)

Arguments

outputs.dir

Output directory of rMATS.

method

Either "JC" or "JCEC", see details.

Details

This is a convenience function for reading rMATS output, merging different splicing patterns into a single data frame for easier further analysis. Refer to rMATS Github for details on results generated by rMATS: https://github.com/Xinglab/rmats-turbo/blob/v4.3.0/README.md#output

rMATS normalizes lengths of individual splicing variants. There are two methods used for this normalization: JC and JCEC. Refer to the rMATS paper for more details: https://doi.org/10.1073/pnas.1419161111

rMATS coordinated are 0-based; the exact meaning of start and end varies by splicing pattern type:

  • A3SS/A5SS: start/end = start/end of the long exon (inclusion form).

  • SE: start/end = start/end of the skipped exon (inclusion form).

  • MXE: start/end = start of the first exon / end of the second exon.

  • RI: start/end = start/end of the retained intron (inclusion form).

Value

Data frame of rMATS output

Examples

# TODO

Converts rMATS data frame to GenomicRange

Description

Convenience function for converting rMATS data to GRanges.

Usage

rmats_toGRange(df, ...)

Arguments

df

rMATS data read by rmats_read().

...

<tidy-select> See details.

Three columns are used to fill in GRanges information: chr, start_0base, end.

Two columns are by default added to the metadata: geneSymbol, Type.

Extra metadata columns are specified by the tidyselect ellipsis.

Value

GRanges object

Examples

#TODO

Subset SummarizedExperiment

Description

  1. Subset columns using expr_colData.

  2. Evaluate expr_assay on assays() of the subset.

  3. Apply function fn_rowwise to get subset vector.

  4. Use the subset vector to subset rows of the input se.

Usage

se_subset(se, expr_assay, expr_colData, fn_rowwise)

Arguments

se

SummarizedExperiment object.

expr_assay

Expression to evaluate on assays

expr_colData

Expression to evaluate on colData

fn_rowwise

Predicate to define rowwise filter

Value

Subset SummarizedExperiment object.

Examples

# TODO

writeXStringSet with identifiers

Description

Write a XStringSet to file (FASTA/FASTQ) with names as identifiers.

Usage

writeXStringSetNamed(x, filepath, append = FALSE, compress = FALSE, ...)

Arguments

x

Object XString to write to file

filepath

File to write to

append

Must be False (does not support append)

compress

Must be False (does not support compression)

...

Forwarded to Biostrings::writeXStringSet()

Details

Somehow the original Biostrings::writeXStringSet() does not write identifiers. This function uses names provided by name() and write ID.

Examples

#TODO