intro-distQTL

library(distQTL)

Introduction

This vignette demonstrates distribution-based quantitative trait locus mapping as implemented by the R package distQTL. distQTL-mapping is performed by way of Fréchet regression for univariate distribution response objects, and partial \(F\) tests designed for those settings. The R package distQTL utilizes the R package fastfrechet as a Fréchet regression workhorse, and implements its own partial \(F\) test in the function Wasserstein_F. The R package fastfrechet should be automatically installed as part of distQTL; its development version is currently available for download from GitHub here.

disTQTL implements distQTL-mapping through the eponymous function distQTL (i.e. distQTL::distQTL(...)). The function expects a collection of data.table structure inputs, which contain

  1. single nucleotide polymorphism (SNP) genotype information for a set of donors;
  2. single cell RNA-sequence (scRNA-seq) expression data for a set of genes;
  3. fixed donor specific covariate vectors;
  4. cell-specific information, including donor ID label and cell type label; and
  5. gene/SNP information, including names/IDs, chromosome locations, and genotype start loci.

This vignette illustrates how these input objects should be structured. Other inputs, such as donor/gene/SNP filtering specifications and desired cell type groupings, are also explained and illustrated.

The main result of distQTL is a model p-value, testing the partial effect of a (cis-)SNP on gene expression given a set of fixed covariates. The full output object is a nested list structure. At the first list layer, results are split by the desired cell type groupings, e.g. “B cells” or “Monocytes”. Each of these groups then is a list of distQTL results for each gene under consideration. Each gene’s results consist of a vector of partial \(F\) test p-values (presented as \(\log_{10}(p)\) results by default), one for each (cis-)SNP tested. As genes can be differentially expressed across different cell types, some genes might be filtered out for some cell type groups due to low expression; gene level results generally correspond to the small set of SNPs located near each respective gene.

Inputs to distQTL

SNP Genotype, Gene Expression, Covariates, and Other Information

The distQTL function uses data.table objects to allow easier in-place memory usage. Denote the number of SNPs in the data set by nSNPs, the number of genes by nGenes, and the number of fixed covariates (except genotype) by nCovariates. The SNP genotype, gene expression, and covariate information required to perform distQTL are provided in the first three inputs:

  1. genotype: a data.table object with one row per donor, and 1 + nSNPs columns. One column must be labeled donorID, which contains unique donor IDs for the data set. The remaining columns must contain donor genotype coding per SNP - 0 for homozygous major/reference allele A/A, 1 for heterozygous genotype A/a, and 2 for homozygous minor/alternative allele a/a. These columns’ names should match the values given in the geneID column in the geneInfo input object, described below.
  2. geneExpression: a data.table object with one row per cell, and 2 + nGenes columns. One column must be labeled donorID, which contains the donor ID label per-cell, values matching from the donorID column in the genotype object. One column must be labeled cellType, which contains cell type label per-cell. The remaining columns must contain raw gene expression count measurements, e.g. values in {0, 1, 2, ...}. These columns’ names should match the values given in the geneID column in the geneInfo input object, described below.
  3. covariates: a data.table object with one row per donor, and 1 + nCovariates columns, with precisely as many donors as genotype; input will be reordered row-wise to match genotype. One column must be labeled donorID, which contains unique donor IDs for the data set. The remaining columns should contain desired donor-level covariate vectors, such as demographic information, genotype PCAs, PEER factors, batch information, etc. These columns should contain numeric data only, or data which can be coerced to numeric. Any factor type (or equivalent) data should be expanded into 0/1 encoding prior to input. These columns need not have meaningful names.

Information about the genes and SNPs being used is provided in the next two input objects. The main purpose of these inputs is to provide a list of gene and SNP names to match to the columns of the previous objects, for the sake of output naming and linking genes to their cis-SNPs.

  1. geneInfo: a data.table object with one row per gene, and 3 columns. One column must be labeled geneID, which contains unique gene identifiers (e.g. gene names, or Ensembl IDs). One column must be labeled chromosome, which contains unique chromosome identifiers for the genes; the values should match those used in the corresponding chromosome column of the snpInfo input object, described below. One column must be labeled start, which gives the starting base pair locations for the genes in the genome; the reference genome should match the one used in the corresponding start column in the snpInfo input object.
  2. snpInfo: a data.table object with one row per SNP, and 3 columns. One column must be labeled snpID, which contains unique SNP identifiers. One column must be labeled chromosome, which contains unique chromosome identifiers for the SNPs; the values should match those used in the corresponding chromosome column of the geneInfo input object. One column must be labeled start, which gives the starting base pair locations for the SNPs in the genome; the reference genome should match the one used in the corresponding start column in the geneInfo input object.

Gene and SNP chromosome labels should match in the below objects’ columns in order to be tested together. However, these columns, and the start columns, do not in principal have to contain accurate information. For instance, if you would like to test every gene-SNP combination, then you could do this by setting the chromosome columns all to the same number, and the start columns all to the same number, respectively. If you want to test cis-SNPs, as defined by SNPs within a certain base pair radius of a gene start location within the same chromosome, then you should ensure the chromosome and start labels are accurate.

Other Inputs

Other parameters control the cell, gene, donor, and SNP filtering steps which distQTL performs. Filtering can be done to remove genes with low expression, remove donors with low counts of cells, so on. The extra inputs that distQTL takes are:

  1. cellTypeGroups: an optionally named list object, where each element is a character vector containing cell types desired to be grouped together. These cell types should match from the cellType column in the geneExpression input, but need not be exhaustive of those entries. The list entries’ names will be used to label cell type group level outputs; we recommend the list entries’ names are short and descriptive of the groups they correspond to, e.g. cellTypeGroups = list(B_cells = c("..."), Monocytes = c("..."), ...).
  2. m: a positive integer (default 100) containing the empirical distribution grid density. For instance, for a given group of cells y satisfying a donor-cellType-gene combination, the response object is given by quantile(y, seq(from = 0.5 / m, to = 1 - 0.5 / m, length.out = m), type = 1).
  3. cisRange: a positive integer (default 1e5) specifying the range defining cis-SNPs around each gene’s starting location. cis-SNP range is defined by absolute difference in start values between genes in geneInfo and SNPs in snpInfo, conditioned on same chromosome values. If a SNP does not have the same chromosome label as a gene, comparing the snpInfo and geneInfo input objects, then a model will not be fit for that gene-SNP pair.
  4. minCells: a positive integer (default 10) specifying the minimum number of cells a donor must have in order to be included in regression. This is evaluated on a “within cell type group” basis.
  5. minExpr: a scalar between \((0, 1]\) (default 0.01) giving the minimum required proportion of cells, averaged across donors, that have positive gene expression. For instance, if minExpr = 0.10, and there are 2 donors with 7% and 11% positive gene expression among their cells, respectively, then the average positive expression across donors is \((7 + 11)/2 = 9 \leq 10\)%. This gene would then be considered “low expression”, and excluded from distQTL calculations.

In general, increasing m leads to longer computation time; asymptotically we approximate computation time to be quadratic in m. Increasing cisRange results in more gene-SNP pairs being tested; it may be set to +Inf to test all gene-SNP pairs within a chromosome, but be warned this will likely greatly increase computation time and possibly cause memory issues depending on your machine. Increasing minCells will remove donors with lower cell counts. While this leads to a sample with higher donor-level certainty, this also reduces the overall sample size. Setting this parameter too high or too low can both hurt power, and we do not have practical guidelines for tuning it. Finally, increasing minExpr implements a more stringent positive expression threshold for a gene to have any distQTL models performed on it. As in other regression contexts, genes with low expression can have lower signal, making detection challenging and making finite-sample results unstable and unreliable.

Using distQTL

Loading (Part of) the OneK1K Data Set

Part of the OneK1K data set has been included with the distQTL package, specifically expression data from a random selection of 20 genes, donor genotypes for their various cis-SNPs (defined as within a 200,000 base pair gene radius), and various supporting information which will be used by the distQTL function. Loading the package data generates the data objects as exactly required by distQTL.

data(package = "distQTL")

Let’s see the genontype data:

genotype[1:5, 1:5]
#>    donorID GSA-rs62119598 GSA-rs76662040 rs35863918 GSA-rs78580830
#>     <fctr>          <int>          <int>      <int>          <int>
#> 1:     1_1              0              1          0              0
#> 2:     2_2              0              0          0              0
#> 3:     3_3              0              0          0              0
#> 4:     4_4              0              0          0              0
#> 5:     6_6              1              0          0              0

Let’s see the expression data:

geneExpression[1:5, 1:4]
#>    donorID                  cellType ENSG00000079432 ENSG00000100422
#>     <fctr>                    <fctr>           <int>           <int>
#> 1: 686_687              naive B cell               0               0
#> 2: 692_693             memory B cell               0               0
#> 3: 692_693 transitional stage B cell               0               0
#> 4: 690_691             memory B cell               0               0
#> 5: 682_683              naive B cell               0               0

Finally, let’s see the covariate data:

covariates[1:5, 1:7]
#>    donorID          age        age2   sex       PC_1        PC_2       PC_3
#>     <char>        <num>       <num> <num>      <num>       <num>      <num>
#> 1:     1_1  0.002037722 -0.02523903     1  0.8328050  0.05130451  0.4790027
#> 2:     2_2 -0.031252544 -0.02203353     1 -1.4685475 -1.56720938 -0.4987103
#> 3:     3_3 -0.021461289 -0.02887455     1  0.5190437 -1.31829774 -0.6093993
#> 4:     4_4 -0.031252544 -0.02203353     1  0.6246254  1.10665899 -0.6088034
#> 5:     6_6  0.029453235  0.01478500     1 -0.3387446  0.61225647  1.0366904

The gene information geneInfo and snp information snpInfo contain gene and SNP IDs (respectively), chromosomes, and genome sequence locations (defined as the start location in the chromosome sequence; for OneK1K, these are with respect to the hg19 reference genome).

The cellTypeGroups object is a list containing the names of the types of cells each individual measurement comes from. The cellTypeGroups object included in the distQTL data is a two-part list object, the first containing B cell names:

  1. naive B cell,
  2. memory B cell,
  3. transitional stage B cell,

and the second containing monocyte cell names:

  1. CD14-positive monocyte,
  2. CD14-low, CD16-positive monocyte.

Finally, we define the other inputs which will be used in this example.

m = 200
cisRange = 2e5
minCells = 10
minExpr = 0.01

Fitting distQTL Models

Distribution-based QTL-mapping is performed on this subset of the OneK1K below.

t0 = Sys.time()
output = distQTL(genotype,
                 geneExpression,
                 covariates,
                 geneInfo,
                 snpInfo,
                 cellTypeGroups,
                 m,
                 cisRange,
                 minCells,
                 minExpr)
runtime = difftime(Sys.time(), t0, units = "sec")

This evaluated 3699 models in 194.4 seconds, for a model fit time of approximately 0.053 seconds per model.

Let’s look at results for one gene, in Monocytes:

DF = data.frame("pval" = round(10^(output[[2]][[3]]), 6))

# The gene:
names(output[[2]])[3]
#> [1] "ENSG00000107897"

# Top selection of SNPs:
head(DF[order(DF[ , 1]), , drop = FALSE])
#>                    pval
#> GSA-rs79028187 0.000001
#> GSA-rs12245098 0.029021
#> GSA-rs2429499  0.031396
#> rs7908050      0.040590
#> rs651134       0.049764
#> GSA-rs612898   0.049764