This vignette demonstrates distribution-based
quantitative trait
locus mapping as implemented by the R
package distQTL
. distQTL-mapping is performed by way of
Fréchet regression for univariate distribution response objects, and
partial \(F\) tests designed for those
settings. The R
package distQTL
utilizes the
R
package fastfrechet
as a Fréchet regression
workhorse, and implements its own partial \(F\) test in the function
Wasserstein_F
. The R
package
fastfrechet
should be automatically installed as part of
distQTL
; its development version is currently available for
download from GitHub here.
disTQTL
implements distQTL-mapping through the eponymous
function distQTL
(i.e. distQTL::distQTL(...)
).
The function expects a collection of data.table
structure
inputs, which contain
This vignette illustrates how these input objects should be structured. Other inputs, such as donor/gene/SNP filtering specifications and desired cell type groupings, are also explained and illustrated.
The main result of distQTL
is a model p-value, testing
the partial effect of a (cis-)SNP on gene expression given a set of
fixed covariates. The full output object is a nested list structure. At
the first list layer, results are split by the desired cell type
groupings, e.g. “B cells” or “Monocytes”. Each of these groups then is a
list of distQTL
results for each gene under consideration.
Each gene’s results consist of a vector of partial \(F\) test p-values (presented as \(\log_{10}(p)\) results by default), one for
each (cis-)SNP tested. As genes can be differentially expressed across
different cell types, some genes might be filtered out for some cell
type groups due to low expression; gene level results generally
correspond to the small set of SNPs located near each respective
gene.
distQTL
The distQTL
function uses data.table
objects to allow easier in-place memory usage. Denote the number of SNPs
in the data set by nSNPs
, the number of genes by
nGenes
, and the number of fixed covariates (except
genotype) by nCovariates
. The SNP genotype, gene
expression, and covariate information required to perform distQTL are
provided in the first three inputs:
genotype
: a data.table
object with one row
per donor, and 1 + nSNPs
columns. One column must be
labeled donorID
, which contains unique donor IDs for the
data set. The remaining columns must contain donor genotype coding per
SNP - 0
for homozygous major/reference allele
A/A, 1
for heterozygous genotype
A/a, and 2
for homozygous
minor/alternative allele a/a. These columns’ names
should match the values given in the geneID
column in the
geneInfo
input object, described below.geneExpression
: a data.table
object with
one row per cell, and 2 + nGenes
columns. One column must
be labeled donorID
, which contains the donor ID label
per-cell, values matching from the donorID
column in the
genotype
object. One column must be labeled
cellType
, which contains cell type label per-cell. The
remaining columns must contain raw gene expression count measurements,
e.g. values in {0, 1, 2, ...}
. These columns’ names should
match the values given in the geneID
column in the
geneInfo
input object, described below.covariates
: a data.table
object with one
row per donor, and 1 + nCovariates
columns, with precisely
as many donors as genotype
; input will be reordered
row-wise to match genotype
. One column must be labeled
donorID
, which contains unique donor IDs for the data set.
The remaining columns should contain desired donor-level covariate
vectors, such as demographic information, genotype PCAs, PEER factors,
batch information, etc. These columns should contain numeric data only,
or data which can be coerced to numeric. Any factor type (or equivalent)
data should be expanded into 0/1
encoding prior to input.
These columns need not have meaningful names.Information about the genes and SNPs being used is provided in the next two input objects. The main purpose of these inputs is to provide a list of gene and SNP names to match to the columns of the previous objects, for the sake of output naming and linking genes to their cis-SNPs.
geneInfo
: a data.table
object with one row
per gene, and 3
columns. One column must be labeled
geneID
, which contains unique gene identifiers (e.g. gene
names, or Ensembl IDs). One column must be labeled
chromosome
, which contains unique chromosome identifiers
for the genes; the values should match those used in the corresponding
chromosome
column of the snpInfo
input object,
described below. One column must be labeled start
, which
gives the starting base pair locations for the genes in the genome; the
reference genome should match the one used in the corresponding
start
column in the snpInfo
input object.snpInfo
: a data.table
object with one row
per SNP, and 3
columns. One column must be labeled
snpID
, which contains unique SNP identifiers. One column
must be labeled chromosome
, which contains unique
chromosome identifiers for the SNPs; the values should match those used
in the corresponding chromosome
column of the
geneInfo
input object. One column must be labeled
start
, which gives the starting base pair locations for the
SNPs in the genome; the reference genome should match the one used in
the corresponding start
column in the geneInfo
input object.Gene and SNP chromosome
labels should match in the below
objects’ columns in order to be tested together. However, these columns,
and the start
columns, do not in principal have to contain
accurate information. For instance, if you would like to test every
gene-SNP combination, then you could do this by setting the
chromosome
columns all to the same number, and the
start
columns all to the same number, respectively. If you
want to test cis-SNPs, as defined by SNPs within a certain base pair
radius of a gene start location within the same chromosome, then you
should ensure the chromosome
and start
labels
are accurate.
Other parameters control the cell, gene, donor, and SNP filtering
steps which distQTL performs. Filtering can be done to remove genes with
low expression, remove donors with low counts of cells, so on. The extra
inputs that distQTL
takes are:
cellTypeGroups
: an optionally named list object, where
each element is a character vector containing cell types desired to be
grouped together. These cell types should match from the
cellType
column in the geneExpression
input,
but need not be exhaustive of those entries. The list entries’ names
will be used to label cell type group level outputs; we recommend the
list entries’ names are short and descriptive of the groups they
correspond to, e.g.
cellTypeGroups = list(B_cells = c("..."), Monocytes = c("..."), ...)
.m
: a positive integer (default 100
)
containing the empirical distribution grid density. For instance, for a
given group of cells y
satisfying a donor-cellType-gene
combination, the response object is given by
quantile(y, seq(from = 0.5 / m, to = 1 - 0.5 / m, length.out = m), type = 1)
.cisRange
: a positive integer (default 1e5
)
specifying the range defining cis-SNPs around each gene’s starting
location. cis-SNP range is defined by absolute difference in
start
values between genes in geneInfo
and
SNPs in snpInfo
, conditioned on same
chromosome
values. If a SNP does not have the same
chromosome
label as a gene, comparing the
snpInfo
and geneInfo
input objects, then a
model will not be fit for that gene-SNP pair.minCells
: a positive integer (default 10
)
specifying the minimum number of cells a donor must have in order to be
included in regression. This is evaluated on a “within cell type group”
basis.minExpr
: a scalar between \((0, 1]\) (default 0.01
) giving
the minimum required proportion of cells, averaged across donors, that
have positive gene expression. For instance, if
minExpr = 0.10
, and there are 2 donors with 7% and 11%
positive gene expression among their cells, respectively, then the
average positive expression across donors is \((7 + 11)/2 = 9 \leq 10\)%. This gene would
then be considered “low expression”, and excluded from distQTL
calculations.In general, increasing m
leads to longer computation
time; asymptotically we approximate computation time to be quadratic in
m
. Increasing cisRange
results in more
gene-SNP pairs being tested; it may be set to +Inf
to test
all gene-SNP pairs within a chromosome, but be warned this will likely
greatly increase computation time and possibly cause memory issues
depending on your machine. Increasing minCells
will remove
donors with lower cell counts. While this leads to a sample with higher
donor-level certainty, this also reduces the overall sample size.
Setting this parameter too high or too low can both hurt power, and we
do not have practical guidelines for tuning it. Finally, increasing
minExpr
implements a more stringent positive expression
threshold for a gene to have any distQTL models performed on it. As in
other regression contexts, genes with low expression can have lower
signal, making detection challenging and making finite-sample results
unstable and unreliable.
distQTL
Part of the OneK1K data set has been included with the
distQTL
package, specifically expression data from a random
selection of 20 genes, donor genotypes for their various cis-SNPs
(defined as within a 200,000 base pair gene radius), and various
supporting information which will be used by the distQTL
function. Loading the package data generates the data objects as exactly
required by distQTL
.
Let’s see the genontype data:
genotype[1:5, 1:5]
#> donorID GSA-rs62119598 GSA-rs76662040 rs35863918 GSA-rs78580830
#> <fctr> <int> <int> <int> <int>
#> 1: 1_1 0 1 0 0
#> 2: 2_2 0 0 0 0
#> 3: 3_3 0 0 0 0
#> 4: 4_4 0 0 0 0
#> 5: 6_6 1 0 0 0
Let’s see the expression data:
geneExpression[1:5, 1:4]
#> donorID cellType ENSG00000079432 ENSG00000100422
#> <fctr> <fctr> <int> <int>
#> 1: 686_687 naive B cell 0 0
#> 2: 692_693 memory B cell 0 0
#> 3: 692_693 transitional stage B cell 0 0
#> 4: 690_691 memory B cell 0 0
#> 5: 682_683 naive B cell 0 0
Finally, let’s see the covariate data:
covariates[1:5, 1:7]
#> donorID age age2 sex PC_1 PC_2 PC_3
#> <char> <num> <num> <num> <num> <num> <num>
#> 1: 1_1 0.002037722 -0.02523903 1 0.8328050 0.05130451 0.4790027
#> 2: 2_2 -0.031252544 -0.02203353 1 -1.4685475 -1.56720938 -0.4987103
#> 3: 3_3 -0.021461289 -0.02887455 1 0.5190437 -1.31829774 -0.6093993
#> 4: 4_4 -0.031252544 -0.02203353 1 0.6246254 1.10665899 -0.6088034
#> 5: 6_6 0.029453235 0.01478500 1 -0.3387446 0.61225647 1.0366904
The gene information geneInfo
and snp information
snpInfo
contain gene and SNP IDs (respectively),
chromosomes, and genome sequence locations (defined as the start
location in the chromosome sequence; for OneK1K, these are with respect
to the hg19 reference genome).
The cellTypeGroups
object is a list containing the names
of the types of cells each individual measurement comes from. The
cellTypeGroups
object included in the distQTL
data is a two-part list object, the first containing B cell names:
and the second containing monocyte cell names:
Finally, we define the other inputs which will be used in this example.
Distribution-based QTL-mapping is performed on this subset of the OneK1K below.
t0 = Sys.time()
output = distQTL(genotype,
geneExpression,
covariates,
geneInfo,
snpInfo,
cellTypeGroups,
m,
cisRange,
minCells,
minExpr)
runtime = difftime(Sys.time(), t0, units = "sec")
This evaluated 3699 models in 194.4 seconds, for a model fit time of approximately 0.053 seconds per model.
Let’s look at results for one gene, in Monocytes:
DF = data.frame("pval" = round(10^(output[[2]][[3]]), 6))
# The gene:
names(output[[2]])[3]
#> [1] "ENSG00000107897"
# Top selection of SNPs:
head(DF[order(DF[ , 1]), , drop = FALSE])
#> pval
#> GSA-rs79028187 0.000001
#> GSA-rs12245098 0.029021
#> GSA-rs2429499 0.031396
#> rs7908050 0.040590
#> rs651134 0.049764
#> GSA-rs612898 0.049764