Introduction to AzureStor

Hong Ooi

This is a short introduction on how to use AzureStor.

The Resource Manager interface: creating and deleting storage accounts

AzureStor implements an interface to Azure Resource Manager, which you can use manage storage accounts: creating them, retrieving them, deleting them, and so forth. This is done via the appropriate methods of the az_resource_group class. For example, the following code shows how you might create a new storage account from scratch.

# create a new resource group for the storage account
rg <- AzureRMR::get_azure_login()$
    get_subscription("{subscription_id}")$
    create_resource_group("myresourcegroup", location="australiaeast")

# create the storage account
stor <- rg$create_storage_account("mystorage")
stor
# <Azure resource Microsoft.Storage/storageAccounts/mystorage>
#   Account type: StorageV2
#   SKU: name=Standard_LRS, tier=Standard 
#   Endpoints:
#     dfs: https://mystorage.dfs.core.windows.net/
#     web: https://mystorage.z26.web.core.windows.net/
#     blob: https://mystorage.blob.core.windows.net/
#     queue: https://mystorage.queue.core.windows.net/
#     table: https://mystorage.table.core.windows.net/
#     file: https://mystorage.file.core.windows.net/ 
# ---
#   id: /subscriptions/35975484-5360-4e67-bf76-14fcb0ab5b9d/resourceGroups/myresourcegroup/providers/Micro ...
#   identity: NULL
#   location: australiaeast
#   managed_by: NULL
#   plan: NULL
#   properties: list(networkAcls, supportsHttpsTrafficOnly, encryption, provisioningState, creationTime,
#     primaryEndpoints, primaryLocation, statusOfPrimary)
#   tags: list()
# ---
#   Methods:
#     check, delete, do_operation, get_account_sas, get_blob_endpoint, get_file_endpoint, get_tags, list_keys,
#     set_api_version, set_tags, sync_fields, update

Without any options, this will create a storage account with the following parameters: - General purpose account (all storage types supported) - Locally redundant storage (LRS) replication - Hot access tier (for blob storage) - HTTPS connection required for access

You can change these by setting the arguments to create_storage_account(). For example, to create an account with geo-redundant storage replication and the default blob access tier set to “cool”:

stor2 <- rg$create_storage_account("myotherstorage",
    replication="Standard_GRS",
    access_tier="cool")

And to create a blob storage account and allow non-encrypted (HTTP) connections:

blobstor <- rg$create_storage_account("myblobstorage",
    kind="blobStorage",
    https_only=FALSE)

You can verify that these accounts have been created by going to the Azure Portal (https://portal.azure.com/).

One factor to remember is that all storage accounts in Azure share a common namespace. For example, there can only be one storage account named “mystorage” at a time, across all Azure users.

To retrieve an existing storage account, use the get_storage_account() method. Only the storage account name is required.

# retrieve one of the accounts created above
stor2 <- rg$get_storage_account("myotherstorage")

Finally, to delete a storage account, you simply call its delete() method. Alternatively, you can call the delete_storage_account() method of the az_resource_group class, which will do the same thing. In both cases, AzureStor will prompt you for confirmation that you really want to delete the storage account.

# delete the storage accounts created above
stor$delete()
stor2$delete()
blobstor$delete()

# if you don't have a storage account object, use the resource group method:
rg$delete_storage_account("mystorage")
rg$delete_storage_account("myotherstorage")
rg$delete_storage_account("myblobstorage")

The client interface: working with storage

Storage endpoints

Perhaps the more relevant part of AzureStor for most users is its client interface to storage. With this, you can upload and download files and blobs, create containers and shares, list files, and so on. Unlike the ARM interface, the client interface uses S3 classes. This is for a couple of reasons: it is more familiar to most R users, and it is consistent with most other data manipulation packages in R, in particular the tidyverse.

The starting point for client access is the storage_endpoint object, which stores information about the endpoint of a storage account: the URL that you use to access storage, along with any authentication information needed. The easiest way to obtain an endpoint object is via the storage account resource object’s get_blob_endpoint(), get_file_endpoint() and get_adls_endpoint() methods:

More practically, you will usually have to work with a storage endpoint without having access to the resource itself. In this case, you can create the endpoint object directly with the storage_endpoint function. When you create the endpoint this way, you have to provide the access key explicitly (assuming you know what it is).

Instead of an access key, you can provide either an authentication token or a shared access signature (SAS) to gain authenticated access. The main difference between using a key and these methods is that a key unlocks access to the entire storage account. A user who has a key can access all containers and files, and can transfer, modify and delete data without restriction. On the other hand, a user with a token or a SAS can be limited to have access only to specific containers, or be limited to read access, or only for a given span of time, and so on. This is usually much better in terms of security.

Usually, these authentication objects will be provided to you by your system administrator. However, if you have the storage account resource object, you can generate and use a SAS as follows. Note that generating a SAS requires the storage account’s access key.

Storage container access

The client interface for AzureStor supports blob storage, file storage, and Azure Data Lake Storage Gen 2. All of these storage types have a similar structure. In particular, the storage within each type is organised into containers: blob containers, file shares, and ADLSgen2 filesystems.

Given an endpoint object, AzureStor provides the following generics for working with containers. They will dispatch to the appropriate underlying methods for each storage type.

Here is some example blob container code showing their use. The file share and ADLSgen2 filesystem code is very similar.

As a convenience, instead of providing an endpoint object and a container name, you can also provide the full URL to the container. If you do this, you’ll also have to supply any necessary authentication details such as the access key or SAS.

File transfers

To transfer files and blobs to and from a storage container, use the following generics. As before, the appropriate method will be called for the type of storage.

The storage_multiupload and storage_multidownload methods use a pool of background R processes to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don’t have to wait for the pool to be (re-)created each time.

AzureStor also provides the following generics as convenience functions:

Managing storage objects

AzureStor provides the following generics for managing files and blobs within a storage container.

As blob storage doesn’t support directories, create_storage_dir and delete_storage_dir will throw an error if called on a blob container.

For more information about the different types of storage, see the Microsoft Docs site. Note that there are other types of storage (queue, table) that do not have a client interface exposed by AzureStor.