ghypernet Tutorial

2021-10-15

This is a short tutorial to show how to employ the ghypernet R package for the estimation of gHypEG. In the following, we use Zachary’s Karate Club dataset as baseline. The data is provided within this package and can be simply loaded. It corresponds to a weighted adjacency matrix whose entries report the number of interactions between individuals. The graph specified by the adjacency matrix is undirected and does not have self-loops. For this reason, we specify the two parameters directed=FALSE and selfloops=FALSE.

library(ghypernet)

directed <- FALSE
selfloops <- FALSE
#>          Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 Actor 7 Actor 8 Actor 9
#> Mr Hi        0       4       5       3       3       3       3       2       2
#> Actor 2      4       0       6       3       0       0       0       4       0
#> Actor 3      5       6       0       3       0       0       0       4       5
#> Actor 4      3       3       3       0       0       0       0       3       0
#> Actor 5      3       0       0       0       0       0       2       0       0
#> Actor 6      3       0       0       0       0       0       5       0       0
#> Actor 7      3       0       0       0       2       5       0       0       0
#> Actor 8      2       4       4       3       0       0       0       0       0
#> Actor 9      2       0       5       0       0       0       0       0       0
#> Actor 10     0       0       1       0       0       0       0       0       0
#>          Actor 10
#> Mr Hi           0
#> Actor 2         0
#> Actor 3         1
#> Actor 4         0
#> Actor 5         0
#> Actor 6         0
#> Actor 7         0
#> Actor 8         0
#> Actor 9         0
#> Actor 10        0

Fitting a gHypEG Model

GHypEG Models can be fitted using the general function ghype. However, we provide some default functions to fit specific models. These functions take an adjacency matrix (not sparse) as main argument, or an igraph graph object. They further allow to specify whether the model should be directed or not, and whether it should have selfloops or not.

The function regularm allows to fit the simplest possible model that can be formulated with gHypEGs: gnp-like graph where only the average degree is preserved.

(regularmodel <- regularm(graph = adj_karate, directed = directed, selfloops = selfloops))
#> Call:
#> ghype.matrix(graph = graph, directed = directed, selfloops = selfloops,
#>     unbiased = TRUE, regular = TRUE)
#> ghype undirected , no selfloops
#> 34 vertices, 231 edges
#> Loglikelihood:
#> -584.7139
#> df: 1

The function scm allows to fit the soft-configuration model, where all degrees are preserved.

(confmodel <- scm(graph = adj_karate, directed = directed, selfloops = selfloops))
#> Call:
#> ghype.matrix(graph = graph, directed = directed, selfloops = selfloops,
#>     unbiased = TRUE, regular = FALSE)
#> ghype undirected , no selfloops
#> 34 vertices, 231 edges
#> Loglikelihood:
#> -434.5449
#> df: 34

From the models, we can generate random realisations that can be used, e.g., as null-models to test hypothesis about the data. This can be achieved by means of the function rghype, specifying the number of realisations to take.

random_graphs_rm <- rghype(nsamples = 10, model = regularmodel, seed = 123)
random_graphs_scm <- rghype(nsamples = 10, model = confmodel, seed = 123)

Finally, we can perform model selection and hypothesis testing by comparing statistics of two models. We can follow two different approaches, comparing information scores such as AIC, or performing a likelihood ratio test. One simple question we can ask, for example, is whether the degree sequence of the Karate Club can be simply explained by a regular model, or there is the need to use a configuration model. AIC scores can be obtained using the standard R function AIC. Likelihood-ratio tests are performed using the function lr.test. However, we provide some wrappers for the function lr.test for some commonly done tests like the one above.

Comparing AIC scores for the regular model and the configuration model yields the following result:

AIC(regularmodel)
#> [1] 1171.428
AIC(confmodel)
#> [1] 937.0897

# difference in AICs, high value means more complex model is better
AIC(regularmodel) - AIC(confmodel)
#> [1] 234.338

This shows that the there is strong evidence for the employing the configuration model, because the degree sequence of the graph strongly deviates from the regular one.

To increase confidence in this result, we can compare the result above with that obtained from random data generated from the regular model.

# Generate regular models and configuration models for random graphs
regularmodels <- lapply(X = random_graphs_rm, FUN = regularm, directed=directed, selfloops=selfloops)
confmodels <- lapply(X = random_graphs_rm, FUN = scm, directed=directed, selfloops=selfloops)
# Compute AICs
AIC_regularmodels <- sapply(X = regularmodels,FUN = AIC)
AIC_confmodels <- sapply(X = confmodels,FUN = AIC)
# differences in AIC, high value means more complex model is better
AIC_regularmodels - AIC_confmodels
#>  [1] -46.55258 -32.36789 -20.92125 -28.61791 -34.33781 -39.78901 -31.21541
#>  [8] -30.15379 -18.92063 -41.49301

As can be seen by the experiment above, in the case of random graphs generated from the regular model, the AIC of the configuration model is higher than that of the regular model, confirming that the added complexity is not justified.

The (empirical) likelihood-ratio test gives a similar result, with the benefit of providing a p-value. The function conf.test provides a simple way to perform the test without the need of computing the models.

conf.test(graph = adj_karate, directed = directed, selfloops = selfloops, seed=123)
#>
#>  LR test -- gnp vs CM
#>
#> data:
#> lr = 299.79, df = 33, p-value < 2.2e-16
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  19.63033 51.83806

Similarly to what done above, we can perform the test using the random graphs. Now we expect large p-values.

tests <- lapply(X = random_graphs_rm, FUN = conf.test, directed = directed, selfloops = selfloops, seed=123)
sapply(X = tests, FUN = function(x) x$p.value) #> [1] 0.97652855 0.47749983 0.09433827 0.30929582 0.57422868 0.82213540 #> [7] 0.42243781 0.37426112 0.06527072 0.87972711 Block-Constrained Configuration Model The next model that can be estimated is the block constrained configuration model. The Karate Club has a well-known partitioning into two communities that can be loaded from the package. We fit a bccm using the bccm function, specifying the vertex labels. data("vertexlabels") (blockmodel <- bccm(adj = adj_karate, labels = vertexlabels, directed = directed, selfloops = selfloops)) #> Call: #> bccm(adj = adj_karate, labels = vertexlabels, directed = directed, #> selfloops = selfloops) #> block ghype undirected , no selfloops #> 34 vertices, 231 edges #> Loglikelihood: #> -336.8385 #> df: 36 #> #> Coefficients: #> Estimate Std.Err t value Pr(>t) #> 1<->1 1.00000000 0.00000000 Inf < 2.2e-16 *** #> 1<->2 0.09044494 0.00021692 416.95 < 2.2e-16 *** #> 2<->2 0.91049902 0.00040434 2251.79 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 print(blockmodel$blockOmega)
#>            1          2
#> 1 1.00000000 0.09044494
#> 2 0.09044494 0.91049902

By default, the function fits a ‘full’ block structure, where every parameter for in- out-blocks relations are different. In this case this corresponds to three parameters: one for block 1, one for block 2, one for the edges between the two blocks. However, in some cases we see that the parameters of in-blocks relations are similar to each other, as can be seen above looking at the diagonal entries of the block matrix. In this case, a full block structure may overfit the data, while a simpler model could be better. This can be fitted setting the parameter homophily=TRUE. This corresponds to a model where there are only two parameters, one for in-block edges, one for out-block edges, irrespectively of the number of blocks. The result is shown below.

(blockmodel_2 <- bccm(adj = adj_karate, labels = vertexlabels, directed = directed, selfloops = selfloops, homophily = TRUE))
#> Call:
#>     selfloops = selfloops, homophily = TRUE)
#> block ghype undirected , no selfloops
#> 34 vertices, 231 edges
#> Loglikelihood:
#> -337.0666
#> df: 35
#>
#> Coefficients:
#>             Estimate    Std.Err t value    Pr(>t)
#> homologue 1.00000000 0.00000000     Inf < 2.2e-16 ***
#> hetero    0.09512422 0.00020628  461.14 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first test we need to perform is whether either of the model just fitted significantly improve the fit to the data. We do so by performing a likelihood ratio test between the configuration model and the bccms.

lr.test(nullmodel = confmodel, altmodel = blockmodel, seed = 123)
#>
#>  LR test
#>
#> data:
#> lr = 195.41, df = 2, p-value < 2.2e-16
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  1.306644e-08 8.230181e+00
lr.test(nullmodel = confmodel, altmodel = blockmodel_2, seed = 123)
#>
#>  LR test
#>
#> data:
#> lr = 194.96, df = 1, p-value < 2.2e-16
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  1.474561e-08 8.031860e+00

Unsurprisingly, the test gives low p-values, confirming the presence of a block structure. Again, we can perform a similar analysis using AIC:

AIC(confmodel)
#> [1] 937.0897
AIC(blockmodel_2)
#> [1] 744.1332
AIC(blockmodel)
#> [1] 745.677

From this, we note that while specifying the full bccm gives an improvement in the score compared to the two parameters model, such improvement is relatively small, and appears to not justify the increased complexity. We can verify again this with a likelihood-ratio test:

lr.test(nullmodel = blockmodel_2, altmodel = blockmodel, seed=123)
#>
#>  LR test
#>
#> data:
#> lr = 0.45622, df = 1, p-value = 0.7645
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  0.03722328 6.94467855

The result shows that the added complexity of the second model is not justified by the data. We investigate this further by comparing the results with what obtained from random variations.

# First generate random sample from blockmodel
random_graphs_bccm2 <- rghype(nsamples = 100, model = blockmodel_2, seed = 123)
# Generate the two models for random graphs
blockmodels <- lapply(X = random_graphs_bccm2, FUN = bccm, labels = vertexlabels, directed=directed, selfloops=selfloops)
blockmodel_2s <- lapply(X = random_graphs_bccm2, FUN = bccm, labels = vertexlabels, directed=directed, selfloops=selfloops, homophily = TRUE)
# Compute AICs
AIC_blockmodels <- sapply(X = blockmodels, FUN = AIC)
AIC_blockmodel_2s <- sapply(X = blockmodel_2s, FUN = AIC)
# mean difference in AIC, high value means more complex model is better
summary(AIC_blockmodel_2s - AIC_blockmodels)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#> -2.0000 -1.5366 -0.6637 -0.1704  0.6268  8.3348

This confirms the fact that the asymmetry in the two blocks can ascribed to random variations.

Goodness of Fit

Finally, our framework allows to evaluate the goodness-of-fit of model to data. Similarly to a multinomial goodness-of-fit, we perform a test of the chosen against the maximally complex model that can be formulated. This model provides the baseline against which comparing simpler models in terms of their goodness-of-fit. The ‘full model’ is specified such that the observed graph is the expected one. It is fitted using the function ghype with the flag unbiased=FALSE. The likelihood ratio test can be performed either manually or using the function gof.test.

fullmodel <- ghype(graph = adj_karate, directed = directed, selfloops = selfloops, unbiased = FALSE)

lr.test(nullmodel = blockmodel_2, altmodel = fullmodel, seed = 123)
#>
#>  LR test
#>
#> data:
#> lr = 454.89, df = 559, p-value = 7.719e-14
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  270.4443 340.7790
gof.test(model = blockmodel_2, seed = 123)
#>
#>  LR test -- GOF
#>
#> data:
#> lr = 454.89, df = 559, p-value = 7.719e-14
#> alternative hypothesis: one.sided
#> 95 percent confidence interval:
#>  270.4443 340.7790

The reason for this result is the fact that, although the bccm is a good fit for the empirical graph, the latter is characterised by some particular structures that are not encoded by the bccm. For example, the empirical graph is characterised by few ‘bridges’ between the two communities, managed by relatively low degree vertices. Instead, the bccm assumes that all vertices connect weakly the two communities, with an amount of edges proportional to the degree.