synthesis

op <- par(no.readonly = TRUE)
require(zoo)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(synthesis)

1 Introduction

Synthetic data generation has been widely used not only for its usage in privacy preserving data publishing but also for its capability to support testing of new algorithms or methods. The package of synthesis generates synthetic time series from commonly used statistical models, including linear, nonlinear and chaotic systems. So far, synthesis consists of five groups of statistical models, including linear, nonlinear, dynamical, classification, and state-space systems. The usage of the synthesis package covers the synthetic data generation for variable selection, prediction, and classification and clustering based problems.

2 Overview

2.1 Linear models

2.1.1 Random walk model

The base model of the random walk with drift model (Shumway and Stoffer 2011) is given by, \[\begin{equation} \label{eq:2} {{x}_{t}}\text{ =}\delta \text{ +}{{x}_{t-1}}\text{ +}{{w}_{t}} \end{equation}\] where \(t= 1, 2, ..., n\) and \({{w}_{t}}\) is Gaussian white noise, \({{w}_{t}}\sim N(0,\sigma_w^{2})\). The constant \(\delta\) is known as the drift, and when \(\delta =0\), it is regarded as a random walk model. The term random walk origins from the fact that when \(\delta =0\), the value of the time series at time \(t\) is the value of the series at time \(t-1\) plus a completely random movement determined by \({{w}_{t}}\) (Jiang, Sharma, and Johnson 2019). Note that the equation can be formulated as a cumulative sum of white noise variates as follows: \[\begin{equation} \label{eq:3} {{x}_{t}}\text{ = }\delta t\text{ + }\sum\limits_{j=1}^{t}{{{w}_{t}}} \end{equation}\] where the drift \(\delta\) in the model can be seen as the trend of the time series. Therefore, this model is a good proxy for simulating trend, for example, the global temperature data. In the figure below, we show two synthetic time series with and without drift.

2.1.2 Autoregressive model

Autoregressive model (AR), as its name suggests, is a regression with lagged values of the variable itself. The general expression of a \(p\)th-order autoregressive process can be written as (Cryer and Chan 2008): \[\begin{equation} {{x}_{t}}=c+{{\phi }_{1}}{{x}_{t-1}}+{{\phi }_{2}}{{x}_{t-2}}+\cdots +{{\phi }_{p}}{{x}_{t-p}}+{{\varepsilon }_{t}} \end{equation}\]

The following linear AR models are given by Sharma (2000) and included in this package: \[\begin{align} {{x}_{t}}=0.9{{x}_{t-1}}+0.866{{\varepsilon }_{t}}\\ {{x}_{t}}=0.6{{x}_{t-1}}-0.4{{x}_{t-4}}+{{\varepsilon }_{t}}\\ {{x}_{t}}=0.3{{x}_{t-1}}-0.6{{x}_{t-4}}-0.5{{x}_{t-9}}+{{\varepsilon }_{t}} \end{align}\]

where \(\epsilon\) is random Gaussian noise with zero mean and unit standard deviation. For each model, \({{x}_{t}}\) was arbitrarily initialized and a total number of \(N+500\) data points were generated. The first 500 points were discarded to reduce any effects from the arbitrary initialization. These AR models are well studied particularly for validating variable selection algorithms (Galelli et al. 2014; Jiang, Sharma, and Johnson 2019, 2020). It should be noted that variants of the random walk and autoregressive models introduced here can also be generated by the arima.sim() from the stats package (R Core Team 2020).

2.1.3 Threshold autoregressive model

Here, we show two examples of the two-regime threshold autoregressive (TAR) process(Cryer and Chan 2008) given by Sharma (2000): \[\begin{equation} {{x}_{t}}= \begin{cases} -0.9{{x}_{t-3}}+0.1{{\varepsilon }_{t}} & if\text{ }{{x}_{t-3}}\le 0 \\ 0.4{{x}_{t-3}}+0.1{{\varepsilon }_{t}} & if\text{ }{{x}_{t-3}}>0 \end{cases} \end{equation}\]

\[\begin{equation} {{x}_{t}}= \begin{cases} 0.5{{x}_{t-6}}-0.5{{x}_{t-10}}+0.1{{\varepsilon }_{t}} & if\text{ }{{x}_{t-6}}\le 0 \\ 0.8{{x}_{t-10}}+0.1{{\varepsilon }_{t}} & if\text{ }{{x}_{t-6}}>0 \end{cases} \end{equation}\] where \(\epsilon\) is random Gaussian noise with zero mean and unit standard deviation. Similarly, for each model, \({{x}_{t}}\) was arbitrarily initialized and a total number of \(N+500\) data points were generated. The first 500 points were discarded to reduce any effects from the arbitrary initialization.

set.seed(2021)
sample=500

###Synthetic example - RW model
data.rw <- data.gen.rw(nobs=sample,drift=0.1,sd=1)

plot.ts(data.rw$xd, ylim=c(-35,55), main="Random walk", xlab=NA, ylab=NA, cex.axis=1.5)
lines(data.rw$x, col=4); abline(h=0, col=4, lty=2); abline(a=0, b=.1, lty=2)
Example of Random walk model

Example of Random walk model


###Synthetic example - AR models
data.ar1 <- data.gen.ar1(nobs=sample)
data.ar4 <- data.gen.ar4(nobs=sample)
data.ar9 <- data.gen.ar9(nobs=sample)

plot.zoo(cbind(data.ar1$x,data.ar4$x,data.ar9$x), col=c("black","red","blue"),
         ylab=c("AR1","AR4","AR9"),main=NA, xlab=NA, cex.axis=1.5)
Example of Autoregressive models

Example of Autoregressive models


###Synthetic example - TAR models
# Two TAR models in Sharma (2000)
tar1 <- data.gen.tar1(nobs=1000)$x #TAR in Equation (8)
tar2 <- data.gen.tar2(nobs=1000)$x #TAR in Equation (9)

# Generalized TAR, an example in Jiang et al. (2020)
tar <- data.gen.tar(nobs=1000,ndim=9,phi1=c(0.6,-0.1),
                     phi2=c(-1.1,0),theta=0,d=2,p=2,noise=0.1)$x 

plot.zoo(cbind(tar1,tar2,tar), col=c("black","red","blue"), ylab=c("TAR1","TAR2","TAR"),
              main=NA, xlab=NA, cex.axis=1.5)
Example of Threshold autoregressive models

Example of Threshold autoregressive models

2.1.4 Sinusoidal model

A general form of sinusoidal models can be written as (Shumway and Stoffer 2011):

\[\begin{equation} x_t = Acos(2\pi ft + \phi) \end{equation}\] where A is the amplitude, is the phase, and f is the frequency.

set.seed(2021)
sample=500

sw <- synthesis::data.gen.SW(nobs=sample, freq=25, A=2, phi=0.6*pi, mu=0, sd=0.1)
plot(sw$t,sw$x, type='o', ylab='Cosines', xlab="t")
Example of Sinusoidal model

Example of Sinusoidal model

2.2 Nonlinear systems

2.2.1 Hysteresis loop

\[\begin{equation} \begin{cases} {{x}_{t}}=a\cos (2\pi ft)+s{{\varepsilon }_{t}} \\ {{y}_{t}}=b\cos {{(2\pi ft)}^{m}}-c\sin {{(2\pi ft)}^{n}}+s{{\varepsilon }_{t}} \end{cases} \end{equation}\] where \(a\), \(b\) and \(c\) are parameters, \(m\) and \(n\) are integer numbers, and s is a scaling factor used to alter the levle of noise in the output, which all together specify the classical hysteresis loop (HL). The default HL model datasets was generated from \(y_t\) with \(f = 25Hz\), and additional nine candidate predictors were generated with various frequencies. The default values of the system parameters are \(a = 0.8\), \(b = 0.6\), and \(c = 0.2\), which is known to produce a typical form of hysteresis system. As an example, the formulation of the synthetic data from Jiang, Sharma, and Johnson (2020) is shown in the figure below.

2.2.2 Friedman

\[\begin{equation} y=10\sin (\pi {{x}_{1}}{{x}_{2}})+20\sin {{({{x}_{3}}-0.5)}^{2}}+10{{x}_{4}}+5{{x}_{5}}+s\varepsilon \end{equation}\] where \(s\) is a scaling factor used to alter the level of noise in the output. Variate, \(x_i\), is sampled from a uniform distribution, \(x\sim U(0,1)\), for all \(i = 1, ..., 5\). In the original formulation, 10-dimension inputs \(x\) are generated while only first five inputs are relevant with the response. Additionally, datasets can be generated with both zero and various degrees of collinearity. The 10 candidate inputs were generated from correlated uniform variates according to the method by Fackler (1999). In each generated dataset, additional 500 data points were discarded so as to reduce the effect of an arbitrary initialization.

sample=1000
###synthetic example - Hysteresis loop
#Frequency, sampled from a given range
fd <- c(3,5,10,15,25,30,55,70,95)
data.HL <- data.gen.HL(nobs=sample,m=3,n=5,fp=25,fd=fd, sd.x=0, sd.y=0)

plot(data.HL$x,data.HL$dp[,data.HL$true.cpy], xlab="x", ylab = "y", type = "p", cex.axis=1.5,cex.lab=1.5)
Example of Hysteresis loop

Example of Hysteresis loop

###synthetic example - Friedman
#Friedman with independent uniform variates
data.fm1 <- data.gen.fm1(nobs=sample, ndim = 9, noise = 0)
#Friedman with correlated uniform variates
data.fm2 <- data.gen.fm2(nobs=sample, ndim = 9, r = 0.6, noise = 0) 

plot.zoo(cbind(data.fm1$x,data.fm2$x), col=c("red","blue"), main=NA, xlab=NA,
              ylab=c("Friedman with \n independent uniform variates",
                     "Friedman with \n correlated uniform variates"))