This vignette is an explanation on how to use NIPTeR. It starts with
a short introduction, the second chapter explains the two most important
objects in this package, NIPTSample
and
NIPTControlgroup
. The third chapter explains all the
analysis functions, the arguments the function take and their output.
The last chapter shows some workflow examples.
Non-Invasive Prenatal Testing (NIPT) is a method based on analysis of cell-free fetal (cff) DNA in maternal blood using next-generation sequencing. The basic approach is relatively straightforward: cell-free DNA is isolated from the mothers blood plasma and the obtained DNA fragments are sequenced. Subsequently the number of DNA fragments originating from the different chromosomes is counted, since in case of a fetal trisomy a relative increase of the fraction of reads of the trisomic chromosome is expected. Various analysis methods for trisomy prediction, variation reduction and quality control have been described in the literature. These methods can be combined to improve the sensitivity of the analysis. The NIPTeR package makes these methods available and allows for flexibility in workflow design. NIPTeR uses BAM files as input and calculates chromosomal fractions, which are used for trisomy prediction. These fractions are based on read counts of the different chromosomes. The chromosomes are divided in bins to allow for read count correction.
##Package The atomic unit of analysis in this package is a
NIPTSample
object
A NIPTSample
object is a named list of length 5, having
the following fields:
autosomal_chromosome_reads
A list containing the bin
count matrices for the autosomal chromosomes. If forward and reverse
reads are counted separately the list contains two matrices, if the
reads are counted together it contains one. In these matrices the
columns represent bins and the 22 rows represent the chromosomes. For a
small excerpt of such a matrix see Table 1.
correction_status_autosomal_chromosomes
Correction
status of the autosomal chromosomes. Default is
Uncorrected
sex_chromosome_reads
List containing bin count
matrices for the sex chromosomes. If forward and reverse reads are
counted separately the list contains two matrices, if the reads are
counted together it contains one. In these matrices the columns
represent bins and the two represent rows the chromosomes. For a small
excerpt of such a matrix see Table 2.
correction_status_sex_chromosome
Correction status
of the sex chromosomes. Default is Uncorrected
sample_name
The name of the sample. Default is the
filename of the .bam it originates from with the path prefix and .bam
suffix
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
1 | 308 | 303 | 270 | 157 | 148 | 99 | 146 | 236 |
2 | 243 | 263 | 262 | 269 | 225 | 225 | 285 | 259 |
3 | 0 | 189 | 287 | 245 | 264 | 251 | 258 | 242 |
4 | 624 | 637 | 257 | 240 | 229 | 241 | 251 | 216 |
5 | 448 | 249 | 274 | 262 | 239 | 205 | 241 | 244 |
6 | 0 | 174 | 213 | 142 | 231 | 324 | 355 | 383 |
7 | 247 | 266 | 234 | 217 | 142 | 88 | 264 | 230 |
8 | 163 | 216 | 244 | 209 | 243 | 255 | 259 | 261 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
X | 0 | 160 | 6 | 266 | 127 | 85 | 182 | 235 |
Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The panel of control samples to which the sample of interest can be
compared is also modelled in an object, NIPTControlGroup
. A
control group is a named list and composed of NIPTSample
objects. It has the following fields:
samples
The NIPTSample
objects
correction_status_autosomal_chromosomes
Correction
status of the autosomal chromosomes of the samples. Default is
Uncorrected
correction_status_sex_chromosomes
Correction status
of the sex chromosomes of the samples. Default is
Uncorrected
Description
Description of the control group. This
can either be Generic control group or Fitted to sample
name
To load a bam file, use the bin_bam_sample
function. The
bam file does not need to be sorted, but its recommended to use
pre-sorted bam files
The arguments bin_bam_sample
takes:
bam_filepath
The location and filename on the file
system where the bam file is stored.
do_sort
Sort the bam file? If the bam is unsorted
set to true, but the use of pre-sorted bam files is
recommended.
separate_strands
If set to true, reads from forward
and reverse strands are counted and stored separately. This option
should be used if you are planning on using regression, since this
doubles the number of predictors (F+R) and distributes predictive power
more equally over prediction sets since F and R strand from the same
chromosome cannot be both in one prediction set.
custom_name
The sample name. Default samplename is
the filename of the bam file without the .bam suffix and filepath
prefix.
GC content bias is the correlation between the number of reads mapped to a specific genomic region and the GC content of this region. In NIPTeR, two GC bias correction algorithms have been implemented, the LOESS based method introduced by Chen et al. (2011) and the bin weight based method described by Fan and Quake (2010).
nipt_object
The object that will be corrected. This
can either be a NIPTSample
or a
NIPTControlGroup
object.
method
To select the LOESS based method use
“LOESS”, to select the bin weights based method use
“bin”.
include_XY
Also perform correction on the X and Y
chromosomes?
span
The span for the LOESS fit. Only applicable
when LOESS method is used.
ref_genome
The reference genome used. Either
“hg37” or “hg38” default = “hg37”
The output of the gc_correct function is a similar object as the
input; when the input is a NIPTSample
the output will also
be a (corrected) NIPTSample
, and the same applies for a
NIPTControlGroup
object. . After correction the correction
status of the object is set to GC_corrected.
Code snippit:
#Correct NIPTSample object using LOESS method
loess_corrected_sample <- gc_correct(nipt_object = sample_of_interest, method = "LOESS",
include_XY = F, span = 0.75)
#Correct NIPTControlGroup object using bin method
gc_bin_corrected_control_group <- gc_correct(nipt_object = control_group, method = "bin",
include_XY = T)
The matchcontrolgroup
function determines how well an
NIPTSample
fits within the NIPTControlGroup
and, if needed, makes a subset NIPTControlGroup
of length
n.
matchcontrolgroup(nipt_sample, nipt_control_group, mode, n_of_samples,
include_chromosomes = NULL, exclude_chromosomes = NULL)
Explanation of the arguments matchcontrolgroup
takes:
nipt_sample
The NIPTSample
object that
is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
mode
The function mode. This can either be
“subset” or “report”. Mode “subset” means the
return value will be a new NIPTControlGroup
object
containing n samples. When mode “report” is used the
output is a matrix containing the sum of squares score of the
differences between the chromosomal fractions of the sample and the
control for every control sample, sorted in increasing score.
n_of_samples
The length of the resulting
NIPTControlGroup. Only applicable if mode “subset” is
used.
include_chromosomes
Include potential trisomic
chromosomes into the comparison? Default = NULL, meaning chromosomes 13,
18 and 21 are not included.
exclude_chromosomes
Exclude other autosomal
chromosomes besides chromosomes 13, 18 and 21? Default = NULL.
The output for mode subset is a new
NIPTControlGroup
composed of n samples. The output
for mode report is a matrix with a single column containing the
sum of squares in ascending order.
Sum_of_squares | |
---|---|
Sample 6 | 6.766e-07 |
Sample 8 | 2.159e-06 |
Sample 1 | 3.101e-06 |
Sample 10 | 3.292e-06 |
Sample 9 | 3.588e-06 |
Sample 3 | 3.839e-06 |
Sample 5 | 4.432e-06 |
Sample 2 | 7.892e-06 |
Sample 7 | 9.737e-06 |
Sample 4 | 1.126e-05 |
Code snippit:
#Run matchcontrolgroup in mode "report"
scores_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = gc_LOESS_corrected_control_group,
mode = "report", include_chromosomes = c(13,18))
#Run matchcontrolgroup in mode "subset" and select 50 best matching samples
subset_loess_corrected_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = loess_corrected_control_group,
mode = "subset", n_of_samples = 50)
The chi-squared based variation reduction identifies overdispersed
bins within the control group and corrects these bins in both the sample
of interest and the control group. The function takes in a
NIPTSample
and a NIPTControlGroup
object, both
to be corrected. For every corresponding bin in the control group a
chi-squared score is calculated and this total score is converted to a
normal distribution. Corresponding bins with a normalized score above
chi_cutoff (default 3.5) are corrected by dividing the number
of reads by the total chi-squared score divided by degrees of
freedom
nipt_sample
The NIPTSample
object that
is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chi_cutoff
The normalized cut-off
threshold.
include_XY
Also perform correction on the X and Y
chromosomes?
The output of this function is a named list containing two fields,
the first field is the corrected sample ($sample
) and the
second the corrected control group ($control_group
).
Code snippit:
#Apply chi-squared based variation reduction method
chi_corrected_data <- chicorrect(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = subset_loess_corrected_control_group)
#Extract sample and control group
loess_chi_corrected_sample <- chi_corrected_data$sample
subset_loess_chi_corrected_control_group <- chi_corrected_data$control_group
In the Z-score approach, introduced by Chiu et al in 2008, the
chromosomal fraction of interest of a sample is compared to the
chromosomal fractions of interest of the reference samples, the
NIPTControlGroup
object.
This function takes three arguments:
nipt_sample
The NIPTSample
object that
is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chromo_focus
The chromosome of interest. Most
commonly chromosome 13, 18 or 21. However, every autosomal chromosome
can be predicted.
The output of the function is an object of class
ZscoreResult
. It is a named list containing seven
fields:
sample_Zscore
A numeric, The Z score for the sample
of interest for the sample of interest.
control_group_statistics
Named num of length 3, the
first field being the mean (name mean), the second field is the standard
deviation (name SD) and the third field is the P value of the
Shapiro-Wilk test (name Shapiro_P_value).
control_group_Zscores
Matrix containing the Z scores
of the chromosome of interest for all used control samples.
focus_chromosome
The chromosome of interest. Most
commonly chromosome 13, 18 or 21. However, every autosomal chromosome
can be predicted.
control_group_sample_names
The sample names of all
control group samples used in the analysis.
correction_status
The correction status of the
control group.
sample_name
The sample_name of the sample of
interest.
Code snippit:
The regression based Z-score builds n models with m predictors using stepwise regression with forward selection. The models are used to predict the chromosomal fraction of interest, for the sample and for the control group. The observed fractions are then divided by the expected fraction, and Z-scores are calculated over these ratios. The Z-score is calculated by subtracting one from the ratio of the sample and dividing this result by the coefficient of variation. The coefficient of variation (CV) can either be the Practical or Theoretical CV. The Theoretical CV is the standard error multiplied by the overdispersion. Theoretically, the CV cannot be lower than the standard error of the mean. If it is case the CV is lower than Theoretical CV, then the Theoretical CV is used.
perform_regression(nipt_sample, nipt_control_group, chromo_focus,
n_models = 4, n_predictors = 4, exclude_chromosomes = NULL,
include_chromosomes = NULL, use_test_train_set = T,
size_of_train_set = 0.6, overdispersion_rate = 1.15,
force_practical_vc = F)
nipt_sample
The NIPTSample
object that
is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chromo_focus
The chromosome of interest. Most
commonly chromosome 13, 18 or 21. However, every autosomal chromosome
can be predicted.
n_models
Number of linear models to be made. Default
setting is 4 models.
n_predictors
The number of predictors each model
contains. Default is 4.
exclude_chromosomes
Exclude which autosomal
chromosomes as potential predictors? Default potential trisomic
chromosomes 13, 18 and 21 are exluded.
include_chromosomes
Include potential trisomic
chromosomes? Options are: chromosomes 13, 18 and 21.
use_test_train_set
Use a test and train set to build
the models? Default is TRUE.
size_of_train_set
The size of the train set
expressed in a decimal. Default is 0.6 (60% of the control
samples).
overdispersion_rate
The standard error of the mean
is multiplied by this factor.
force_practical_vc
Ignore the theoretical CV and
always use the practical CV.
The output of this function is an object of type
RegressionResult
, a named list containing:
prediction_statistics
A dataframe with 7 rows and a
column for every model. Table 3 is an example of such a dataframe.Prediction_set_1 | Prediction_set_2 | |
---|---|---|
Z_score_sample | -0.171248675025602 | -0.799303637284865 |
CV | 0.00615122686158584 | 0.00599131304613039 |
cv_types | Practical_CV | Practical_CV |
P_value_shapiro | 0.583158688871931 | 0.503865450973465 |
Predictor_chromosomes | 11 2 5 12 | 10 19 17 16 |
Mean_test_set | 1.00050469190991 | 1.00011356353248 |
CV_train_set | 0.00335713985926308 | 0.00394053718934269 |
control_group_Zscores
A matrix containing the
regression based Z-scores for the control sample
focus_chromosome
The chromosome of interest. Most
commonly chromosome 13, 18 or 21. However, every autosomal chromosome
can be predicted.
correction_status
The correction status of the
control group autosomes.
control_group_sample_names
The sample names of the
test set group.
models
List of the summary.lm() output for every
model.
potential_predictors
The total pool of chromosomes
where the predictors are selected from.
all_control_group_Z_scores
Z-scores for every sample
using theoretical and practical VCs.
additional_statistics
Statistics for both the
practical and theoretical CVs for every prediction set.
The Normalized Chromosome Value or NCV (Sehnert et al., 2011)
method selects a subset of chromosomes to calculate the chromosomal
fractions. The ‘best’ subset is the set which yields the lowest
coefficient of variation for the chromosomal fractions of the chromosome
of interest in the control group. Because a brute force approach is used
to determine the best subset, which can be computationally
intensive,this method is divided into two functions,
prepare_ncv
and calculate_ncv
.
prepare_ncv
returns a template object (NCVTemplate) for a
given chromosome of interest and the control group used. This template
can be used for any number of analyses. If the control group or
chromosome of interest changes, a new template must be made.
calculate_ncv
uses a NCVTemplate and a NIPTSample to
calculate the NCV score for the NIPTSample.
prepare_ncv(nipt_control_group, chr_focus, max_elements,
exclude_chromosomes = NULL, include_chromosomes = NULL,
use_test_train_set = T, size_of_train_set = 0.6)
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chr_focus
The chromosome of interest. Most commonly
chromosome 13, 18 or 21. However, every autosomal chromosome can be
predicted.
max_elements
The maximum number of denominator
chromosomes.
exclude_chromosomes
Exclude which autosomal
chromosomes as potential predictors? Default potential trisomic
chromosomes 13, 18 and 21 are exluded.
include_chromosomes
Include potential trisomic
chromosomes? Options are: chromosomes 13, 18 and 21.
use_test_train_set
Use a test and train set to to
build the models? Default is TRUE.
size_of_train_set
The size of the train set
expressed in a decimal. Default is 0.6 (60% of the control group
samples).
The output of this function is an object of class NCVTemplate, a named list.
denominators
The set of ‘best’
denominators.
focus_chromosome
The chromosome of interest used for
this NCVTemplate
object.
control_group_sample_names
The sample names of the
test set samples.
correction_status
The correction status of the
NIPTControlGroup
used.
control_group_Zscores
The NCV scores for the test
set samples.
potential_denominators
The total pool of
denominators the best denominators are selected from.
control_group_statistics
Named num of length 3, the
first field being the mean (name mean), the second field is the standard
deviation (name SD) and the third field is the P value of the
Shapiro-Wilk test (name Shapiro_P_value).
sample_names_train_set
The names of the samples used
in the train set.
train_set_statistics
Named num of length 3, the
first field being the mean of the train set (name mean), the second
field is the standard deviation of the train set (name SD) and the third
field is the P value of the Shapiro-Wilk test of the train set (name
Shapiro_P_value).
train_set_Zscores
Matrix containing the Z scores for
the train set samples.
nipt_sample
The NIPTSample
object that
is the focus of the analysis.
ncv_template
An object of class NCVTemplate that is
produced by function prepare_ncv
The output of this function is an object of class NCVResult, which is basically the same as a NCVTemplate object, the only difference being the NCV score of the sample appended to the list.
##Control group functions
This functions converts a list of NIPTSample
objects to
a NIPTControlGroup
object and return this object.
1 nipt_samples
A list of NIPTSample
objects.
This functions adds NIPTSample
objects to an existing
control group and returns a new NIPTControlGroup
object.
nipt_control_group
The NIPTControlGroup
to add the samples to.
samples_to_add
A list with sample(s) to add. This
always needs to be a list.
This functions removes duplicate samples from the control group based
on name. It returns a new NIPTControlGroup
object.
nipt_control_group
The NIPTControlGroup
object to remove duplicates from. This function removes a sample from the NIPTControlGroup
object by name. Note that this function uses a regular expression, and
if more sample_names satisfy the regular expression, they will also be
removed. It returns a new NIPTControlGroup
object.
samplename
samplename Regular expression string. All
matching samplenames are removed from the control group.
nipt_control_group
NIPTControlGroup
object where the samples are removed from.
This function computes a regular Z-score for every chromosome of
every sample in the NIPTControlGroup
object. It returns a
named list with diagnostics information.
nipt_control_group
The NIPTControlGroup
to
diagnose.The function returns a named list with 3 fields:
Z_scores
A matrix containing Z-scores for every
sample and every chromosome.
abberant_scores
Dataframe with samplename and
chromosome of Z-scores outside -3 3 range.
control_group_statistics
Matrix with mean, standard
deviation and P value of Shapiro-Wilk test.
#Examples
##Preparation
Since NIPT is based on the comparison of the sample of interest to a
set of samples with no known aneuploidies, the first thing that should
be done before any samples can be analyzed is making this
NIPTControlGroup
. A NIPTControlGroup
is an
object and is composed of NIPTSample
objects. First, all
bam file paths have to be collected in a vector. This tutorial assumes
all .bam files are in the same directory. To collect all bam files the
following command can be used:
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
Subsequently, the filepaths must be fed to the function
bin_bam_sample
. This function loads bam files and converts
them to NIPTSample
objects. For example:
list_NIPT_samples <- lapply(X = bam_filepaths, bin_bam_sample, do_sort = FALSE,
separate_strands = T, custom_name = NULL)
In this example we count the forward and reverse reads separately and a pre-sorted bam file is used, so the argument do_sort is ommited since the default is F.
Using the above command all samples will be loaded with the same
default settings. To load the files with for instance custom names,
store the filenames in a vector of the same length as the bam filepath
vector and use mapply
. For example:
list_NIPT_samples <- mapply(FUN = bin_bam_sample, bam_filepaths, name = names, SIMPLIFY = F)
To convert the list of NIPTSamples
to a
NIPTControlGroup
object use:
control_group <- as_control_group(nipt_samples = list_NIPT_samples)
of course, the fastest and simplest way is to wrap the
as_control_group
function around the loading of samples
like this:
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = T))
or when applying custom names:
control_group <- as_control_group(nipt_samples = mapply(FUN = bin_bam_sample, bam_filepaths,
custom_name = names, SIMPLIFY = FALSE))
To store the control group on disk for later use:
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")
So, in conclusion, the data control group preparation script might
look something like this:
library(NIPTeR)
#Retrieve filenames
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load files and convert to control group
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = FALSE))
#Save control group for later
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")
Assuming a NIPTControlGroup
object has been prepared and
stored on disk we can analyse our first sample. The script starts with
loading the NIPTeR
package:
library(NIPTeR)
After that, we can load a sample of interest with the
bin_bam_sample
function:
sample_of_interest <- bin_bam_sample(bam_filepath = "/path/to/sample.bam", separate_strands = T)
library(NIPTeR)
#Gather all bam filepaths in a vector. Corresponds to 1a in figure
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load all bam files using lapply and feed the results to function as_control_group,
#converting the NIPTSamples to a NIPTControlGroup object. Corresponds to 1b in figure
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = F))
#apply a gc LOESS correction to all samples. Since this can take up to 30 seconds
#sample, doing this once for a control group and store the results can be rewarding
#in terms of analysis time. Corresponds to 2a in figure
gc_control_group <- gc_correct(nipt_object = control_group, method = "LOESS")
#Retrieve control group diagnostics. Corresponds with 3a in figure
control_group_diagnostics <- diagnose_control_group(nipt_control_group = control_group)
#Retrieve samplenames with an abberant Z score for any chromosome and remove these samples
#from the control group. Corresponds with 3b in figure
abberant_sample_names <- unique(control_group_diagnostics$abberant_scores$Sample_name)
for (i in 1:length(abberant_sample_names)){
control_group <- remove_sample_controlgroup(samplename = abberant_sample_names[i],
nipt_control_group = control_group)
}
#Save the gc_corrected control groups to disk. Corresponds to 4a in figure
saveRDS(object = control_group, file = "/path/to/controlgroup.rds")
library(NIPTeR)
#Load sample. Corresponds with 1a in figure
sample_of_interest <- bin_bam_sample(bam_filepath = "/Path/to/bamfile.bam", separate_strands = T)
#Load control group. Corresponds with 1b in figure
control_group <- readRDS("/Path/to/control_group.rds")
#Perform a chi-square based variation reduction on new trimmed control group and sample and
#extract data. Corresponds with 2a in figure
chi_data <- chi_correct(nipt_sample = sample_of_interest, nipt_control_group = control_group)
sample_of_interest <- chi_data$sample
control_group <- chi_data$control_group
#Perform regression for chromosome 21 with default settings, so:
#Create 4 models with 4 predictors each
#All chromosomes are potential predictors except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Assume at least 15% overdispersion in the data
#Dont force practical CV, so if the CV is below 1.15 * standard error of the mean use this as VC
#Corresponds with 3a in figure
regression_score_21 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 21)
### Gather relevant data from the objects on the workspace #######
An example of a simple workflow, including a GC correction, selection of a control group fitted to the sample, a chi-squared based variation reduction, automatic control group managing and trisomy prediction using the Z-score, NCV and the regression-based Z-score is shown in figure 3. Firstly, bin counts are made for the analysed sample (1a) and a relevant control group is loaded (1b). Secondly, variability is reduced by LOESS gc correcting the sample of interest and all samples in the control group (2a), followed by a selection of the 80% of control samples that best match the control sample for further analysis of the sample of interest (2b). Using the control group, both the analysed sample and the control samples are than chi_corrected (2c). Thirdly, the resulting control group is checked for samples having aberrant scores (3a) and if present, those samples are removed (3b). Fourthly, a Z-score is calculated for chromosomes 13, 18 and 21 (4a), as well as a regression-based Z-score (4b). Furthermore, the best denominators for NCV calculation of chromosomes 13, 18 and 21 are determined (4c) and the NCV is calculated for those chromosomes (4d). Results can be gathered from the workspace.
library(NIPTeR)
#Load sample. Corresponds with 1a in figure
sample_of_interest <- bin_bam_sample(bam_filepath = "/Path/to/bamfile.bam")
#Load control group. Corresponds with 1b in figure
control_group <- readRDS("/Path/to/control_group.rds")
#Peform a GC correction type bin for the sample and the controlgroup
#Corresponds with 2a in figure
sample_of_interest <- gc_correct(nipt_object = sample_of_interest, method = "bin")
control_group <- gc_correct(nipt_object = control_group, method = "bin")
#Trim control group by only selecting 80% of best matching control samples
#Corresponds with 2b in figure
control_group <- match_control_group(nipt_sample = sample_of_interest, nipt_control_group = control_group,
mode = "subset", n_of_samples = round(length(control_group$samples) *.8,
digits = 0))
#Perform a chi-square based variation reduction on new trimmed control group and sample and
#extract data. Corresponds with 2c in figure
chi_data <- chi_correct(nipt_sample = sample_of_interest, nipt_control_group = control_group)
sample_of_interest <- chi_data$sample
control_group <- chi_data$control_group
#Retrieve control group diagnostics. Corresponds with 3a in figure
control_group_diagnostics <- diagnose_control_group(nipt_control_group = control_group)
#Retrieve samplenames with an abberant Z score for any chromosome and remove these samples
#from the control group. Corresponds with 3b in figure
abberant_sample_names <- unique(control_group_diagnostics$abberant_scores$Sample_name)
for (i in 1:length(abberant_sample_names)){
control_group <- remove_sample_controlgroup(samplename = abberant_sample_names[i],
nipt_control_group = control_group)
}
#Calculate Z score from chromosomes 13, 18 and 21. Corresponds with 4a in figure
z_score_13 <- calculate_z_score(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 13)
z_score_18 <- calculate_z_score(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 18)
z_score_21 <- calculate_z_score(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 21)
#Perform regression for all potential trisomic chromosomes with default settings, so:
#Create 4 models for every potential trisomic chromosome with 4 predictors each
#All chromosomes are potential predictors except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Assume at least 15% overdispersion in the data
#Dont force practical CV, so if the CV is below 1.15 * standard error of the mean use this as VC
#Corresponds with 4c in figure
regression_score_13 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 13)
regression_score_18 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 18)
regression_score_21 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 21)
#Get NCVTemplates for all potential trisomic chromosomes with max 9 denominators and default settings, so:
#All autosomals chromosomes are potential predictors, except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Corresponds with 4c in figure
new_ncv_template_13 <- prepare_ncv(nipt_control_group = control_group, chr_focus = 13, max_elements = 9)
new_ncv_template_18 <- prepare_ncv(nipt_control_group = control_group, chr_focus = 18, max_elements = 9)
new_ncv_template_21 <- prepare_ncv(nipt_control_group = control_group, chr_focus = 21, max_elements = 9)
#Use the NCVTemplates to get NCV scores for the sample of interest
#Corresponds with 4d in figure
ncv_score_13 <- calculate_ncv_score(nipt_sample = sample_of_interest, ncv_template = new_ncv_template_13)
ncv_score_18 <- calculate_ncv_score(nipt_sample = sample_of_interest, ncv_template = new_ncv_template_18)
ncv_score_21 <- calculate_ncv_score(nipt_sample = sample_of_interest, ncv_template = new_ncv_template_21)
### Gather relevant data from the objects on the workspace #######
#Contact For information contact: Lennart Johansson: [email protected] Morris Swertz: [email protected]