In this vignette we describe how the `phylosamp`

package can be used in reverse, to calculate the necessary sample size to achieve a pre-specified false discovery rate. For example, this calculation would be useful if a researcher is conducting a study for which specimens have yet to be collected, but they want to ensure at least 75% confidence in identified links. To calculate the proportion of links that represent true transmission pairs, the user needs to provide the sensitivity and specificity of the linkage criteria used, the final size of the outbreak, and the desired minimum true discovery rate (\(1-\text{False Discovery Rate}\)):

Imagine we are interested in collecting enough samples from an outbreak of 100 cases to ensure that the identified links are correct at least 75% of the time. The phylogenetic criteria we are using to identify links has a sensitivity of 99% and a specificity of 99.5%:

`library(phylosamp)`

`samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.75)`

`## [1] 10`

Although 10 samples will ensure a false discovery rate of no more than 25%, 10 samples may not provide enough data for analysis. In this case, we can use the optional `min_pairs`

parameter to require that the expected number of links (calculated using the `exp_links`

function) is at least some minimum value:

`samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.75, min_pairs=30)`

`## [1] 50`

In another example, it may be crucial that links are identified with high certainty. So we increase the minimum true discovery rate to 95%:

`samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.95)`

`## Error in samplesize(eta = 0.99, chi = 0.995, N = 100, R = 1, phi = 0.95): Input values do no produce a viable solution`

This result suggests it is not possible to obtain such high certainty with this linkage criteria. We can confirm this with the `truediscoveryrate`

function, which shows that even with 100% sampling of this outbreak, we will correctly identify transmission links no more than 80% of the time.

`truediscoveryrate(eta=0.99, chi=0.995, rho=1, M=100, R=1)`

`## Calculating true discovery rate assuming multiple-transmission and multiple-linkage`

`## [1] 0.8032454`

Further exploration of the `samplesize`

function reveals that there are limited combinations of sensitivity and specificity that produce reasonable sample sizes, and that specificity in particular affects the minimum false discovery rate that can be obtained. Therefore, correctly estimating sensitivity and specificity are of key importance when using these functions to understand transmission.