Tip
With two processors available, the next step takes about 15 minutes to run, depending on your machine.
For a quick demonstration set smaller `mcmc_length` or `iterations`, but expect
results to be unconverged.
``getTimeTreesIterate`` will run BEAST2 on each clone in parallel
(here, ``nproc=2``, so 2 clones at a time).
.. code-block:: r
# edit to your BEAST installation path
beast <- "/Applications/BEAST 2.7.7/bin/"
# estimate clock rate of GC B cells
# if you don't care about convergence, reduce mcmc_length
# ensure you are providing the correct path to the template file downloaded earlier (see Requirements)
gctree = getTimeTreesIterate(gcf,
beast=beast,
template="StrictClock_Standard_EmpFreq.xml",
dir="temp",
id="gc_strict",
time="sample_time",
mcmc_length=1e6,
iterations=10,
nproc=2,
CLOCK_RATE_INIT=0.001,
KAPPA_PRIOR_M=0.67,
KAPPA_PRIOR_S=0.2,
ignore=c("freqParameter"))
gcrate_tree = mean(sapply(gctree$parameters, function(x)filter(x,item=="geneticClockRate")$mean))
print(gcrate_tree)
.. code-block:: r
[1] 0.000363
If it is not feasible to run a strict clock analysis, you can use the slope from a
root-to-tip regression. Here, we estimate the clock rate of germinal center B cells
using a root-to-tip regression.
.. code-block:: r
gcrate_slope = mean(correlationTest(gctrees, time="sample_time")$slope)
print(gcrate_slope)
.. code-block:: r
[1] 0.0003686277
Run getTimeTreesIterate with a TyCHE template
*********************************************
We can now run a type-linked TyCHE model using the estimated GC rate. Here, we
use the ``TypeLinkedExpectedOccupancy`` model, which uses an expected
occupancy method to determine the proportion of each branch in each state.
Features of this template:
- Allows estimation of type-linked clock rates:
- we provide values of the mean (``TRAIT_RATE_MEAN_1``, ``TRAIT_RATE_MEAN_2``)
and sigma (``TRAIT_RATE_SIGMA_1``, ``TRAIT_RATE_SIGMA_2``)
for the prior normal distributions of each clock rate.
- Uses empirical nucleotide frequencies as the equilibrium frequencies.
- Dowser will automatically calculate these frequencies from the input sequences.
- Recommended for most data, especially BCRs.
``getTimeTreesIterate`` is designed to run each analysis iteratively, checking for
convergence after each iteration. If the analyses converge before
reaching the max iterations, it will stop early. It will run each analysis for ``mcmc_length``
MCMC samples (here, ``1e6``), and it will repeat this up to ``iterations`` times (here, ``20``),
so here we have a maximum of ``2e7`` MCMC samples.
The convergence check is based
on the ESS of the parameters reported in the log files. You can exclude parameters
from this ESS check using the ``ignore`` argument (here, we ignore ``freqParameter``, as it
is a fixed value).
.. raw:: html
Tip
This step takes about 30 minutes to run, depending on your machine.
For a quick demonstration set smaller `mcmc_length` or `iterations`, but expect
results to be unconverged.
.. code-block:: r
mixed_trees <- getTimeTreesIterate(
clones,
beast = beast,
template = "TypeLinkedExpectedOccupancy_EstTraitClockRates_EmpFreq.xml",
trait = trait,
time = "sample_time",
dir = "temp",
id = "tyche_eo_est",
log_every = "auto",
nproc = 2,
KAPPA_PRIOR_M = 0.67,
KAPPA_PRIOR_S = 0.2,
TRAIT_RATE_MEAN_1 = gcrate_tree,
TRAIT_RATE_MEAN_2 = 0.000001,
TRAIT_RATE_SIGMA_1 = gcrate_tree * 0.01,
TRAIT_RATE_SIGMA_2 = 0.001,
RATE_INDICATORS = "1 0",
TYPE_SWITCH_INIT = 0.005,
TYPE_SWITCH_ALPHA = 0.001,
TYPE_SWITCH_BETA = 5.0,
TRANSITION_RATE_1_INIT = 1.0,
TRANSITION_RATE_2_INIT = 1.0,
TRANSITION_RATE_ALPHA_1 = 0.1,
TRANSITION_RATE_BETA_1 = 1.0,
TRANSITION_RATE_ALPHA_2 = 0.1,
TRANSITION_RATE_BETA_2 = 1.0,
log_target = 2000,
mcmc_length = 1e6,
ignore = c("freqParameter"),
iterations = 20
)
``getTimeTreesIterate`` will run BEAST2 on each clone in parallel
(here, ``nproc=2``, so 2 at a time).
To capture sufficient information about the posterior distribution while keeping
log files from becoming overly large or unwieldy, we provide the option to set
``log_every="auto"``. This will automatically set the logging frequency based on
the ``mcmc_length`` and ``log_target`` (here, ``2000``, so we aim to have around 2000
samples in the log file). You can also set a fixed logging frequency by providing
an integer value.
The rate indicators (``RATE_INDICATORS``) specify which types can transition to each
other. In a primary immune response we recommend setting this to ``"1 0"``, as GC
B cells can transition to other tissues, but not vice versa. If your data comprises
chronic infections or repeated vaccinations, you may want to allow transitions in both
directions, so you would set this to ``"1 1"``. Note: types are always sorted ASCII alphabetically.
You can specify an initial value for the type switch clock rate (``TYPE_SWITCH_INIT``), which
describes the rate of switching between types. You can also specify the shape alpha
(``TYPE_SWITCH_ALPHA``) and the rate beta (``TYPE_SWITCH_BETA``) of the gamma prior distribution
on the type switch clock rate. Be sure to set the initial value within a reasonable range given
the prior.
You can also specify an initial value for the relative transition rates between types
(``TRANSITION_RATE_1_INIT`` and ``TRANSITION_RATE_2_INIT``), as well as the shape alpha
(``TRANSITION_RATE_ALPHA_1`` and ``TRANSITION_RATE_ALPHA_2``) and the rate beta
(``TRANSITION_RATE_BETA_1`` and ``TRANSITION_RATE_BETA_2``) of the gamma prior distributions of
the relative transition rates. We recommend setting the same prior for each transition
rate except in rare cases. Be sure to set the initial values within a reasonable range given
the prior.
The prior distribution on the transition/transversion rate ratio kappa is a lognormal with
parameters specified by ``KAPPA_PRIOR_M`` and ``KAPPA_PRIOR_S``. Kappa is used by the
nucleotide substitution model, and we recommend these values for BCR analyses.
See ``?getTimeTreesIterate`` and TyCHE and BEAST2 documentation for more details.
Visualize the results
*********************
After the analyses have converged, you can visualize the time trees.
Note: plotTrees sets a default value for the scale bar of 0.01, which is appropriate
for trees with genetic distance branch lengths (mutations per site), but time trees typically
require a larger scale bar. In this case, we know the data spans 200 time units,
so we set ``scale=10`` to make the scale bar more visually interpretable.
.. code-block:: r
plotTrees(mixed_trees, scale=10)[[1]] + geom_point(aes(fill=location), pch=21, size=3)
.. image:: _static/Building-Time-Trees-mixed-trees.png
The ``parameters`` column of ``mixed_trees`` contains a table that collates the output from the BEAST2 analysis. Columns include the parameter (item), the posterior mean, standard error, standard deviation, median, 95% highest posterior density interval, autocorrelation time (ACT), effective sample size, and geometric mean.
The effective sample size (ESS) of a parameter is a measure of how much independent information your MCMC sample contains. Even though MCMC generates many samples, they are typically autocorrelated—each sample depends on the previous one. ESS is the number of independent draws from the target distribution with the same estimation power and can be thought of as the sample size for that parameter.
A higher ESS means your sample more reliably represents the posterior distribution. Low ESS indicates that you need more information, which can indicate the need for longer MCMC runs. We typically recommend an ESS of at least 200 for each estimated parameter.
The autocorrelation time (ACT) of a parameter measures how strongly each sample in the MCMC chain depends on previous samples. The ACT tells you how long the MCMC chain takes to produce a roughly independent sample. ACT is inversely related to ESS.
If we're interested in the estimated tree height, we can filter the parameters table for ``TreeHeight``:
.. code-block:: r
print(mixed_trees$parameters[[2]] %>% filter(item=="TreeHeight"))
.. code-block:: r
item mean stderr stddev median X95.HPDlo X95.HPDup ACT ESS geometric.mean
1 TreeHeight 236.0621 1.595335 19.36631 234.2536 201.9269 275.1676 122150.1 147.3636 235.2881
Our model likely hasn't converged, with multiple parameters having ESS values below 200, particularly the posterior which describes how well the model has converged as a whole. The mean tree height is around 236 time units, with a 95% highest posterior density interval from about 200 to 275 time units. Since we know the data spans 200 time units, this is a high estimate, but this is unsurprising given that the ESS is below 200 and the analysis has likely not converged.
The parameters available will depend on the model you used and what is specified for
logging in the XML template. In this case, we can see all the items that were logged:
.. code-block:: r
print(mixed_trees$parameters[[2]]$item)
.. code-block:: r
[1] "posterior" "likelihood"
[3] "prior" "treeLikelihood.tyche_eo_est_2"
[5] "TreeHeight" "rateIndicator.type.1"
[7] "rateIndicator.type.2" "relativeGeoRates.type.1"
[9] "relativeGeoRates.type.2" "typeSwitchClockRate"
[11] "kappa.tyche_eo_est_2" "BayesianSkyline"
[13] "bPopSizes.1" "bPopSizes.2"
[15] "bPopSizes.3" "bPopSizes.4"
[17] "bPopSizes.5" "bGroupSizes.1"
[19] "bGroupSizes.2" "bGroupSizes.3"
[21] "bGroupSizes.4" "bGroupSizes.5"
[23] "freqParameter.tyche_eo_est_2.1" "freqParameter.tyche_eo_est_2.2"
[25] "freqParameter.tyche_eo_est_2.3" "freqParameter.tyche_eo_est_2.4"
[27] "traitfrequencies.type.1" "traitfrequencies.type.2"
[29] "typeLinkedRates.1" "typeLinkedRates.2"
These include the posterior, likelihood, and prior probabilities of the full model; the tree likelihood; estimated values of the tree height, the clock rates for each type (``typeLinkedRates``), the relative transition rates between types (``relativeGeoRates``), the rate of switching types (``typeSwitchClockRate``); parameters relating to BayesianSkyline (``BayesianSkyline``, ``bPopSizes``, ``bGroupSizes``); the transition/transversion rate ratio kappa for the HKY substitution model; and some fixed parameters that are included in logging for record-keeping convenience (i.e., the empirical frequencies of the nucleotides and the frequencies of the types).
If you want to revisit an analysis and no longer have the ``mixed_trees`` object
in your R environment, you can use ``readBEAST`` to read in the BEAST log and tree
files from the directory (``dir``) you specified in ``getTimeTreesIterate``. Because of this,
it is important to always specify a unique combination of ``dir`` and ``id`` for each analysis.
.. code-block:: r
mixed_trees <- readBEAST(clones, dir="temp", id="tyche_eo_est", beast=beast, trait=trait)
See ``?readBEAST`` for more details.
.. raw:: html
Tip
You can find all of BEAST's output files, including the trees, logs, console logs, and TreeAnnotator outputs, in the `dir` you specified (here, "temp"). You can view these files using BEAST tools such as Tracer.