Transfering research data using dat - Implementing a campus-wide infrastructure


Transfer of data is fundamental for research. Most of the existing
frameworks are difficult to use, are burdensome to IT departments
and disempower users.

The Dat framework is a peer-to-peer protocol proposed to solve some of
these problems. In theory, due to its peer-to-peer nature is does not
require intermediaries. In practice some limitations make its use
quite difficult, especially for less technically inclined users.

Here we discuss some of the major pitfalls of Dat and propose
a campus-wide architecture to mitigate those problems. This
architecture introduces some centralized support, but which is,
at the end of the day, only supplementary to the underlying
peer-to-peer solution. Centralized components help, but 
do not become mandatory.


(#) Major dat Pitfalls - A summary

1. While dat allows users to serve their content, if users close their
   personal computer, then data becomes inaccessible
2. Dat has problems with discovery and communication inside a large
   private network. It is easier to communicate with external entities
   than with internal ones
3. The protocol can be problematic with tightly restricted networks (e.g. that
   do not allow uncommon protocols)

For more techincal details, see a [previous article](/2018/02/24/experiences-with-using-dat-for-research-data-transfer/index.html)

(#) Proposed solutions - overview

Here is a point by point proposal of solutions:

1. A always-on local server that will clone dat sources from users
2. A always-on server that is hosted outside the private network
3. A server that exposes HTTPS versions of dats

(#) Understanding use cases

Before we detail the solutions we discuss two types of users. This
might not have an impact in the proposed solution, but it important to
be aware of potentially completely different requirements.

(##) The organic user

The organic user is a producer or consumer that rarely needs to
transfer data. Tranfer is a rare event for user. Note that the
quantity of data transferred might still be considerable.


(##) The industrial user

This is a user that produces a lot of different datasets. For example,
in bioinformatics a sequencing center would be producing many (large)
datasets for different costumers and thus maintaining potentially
hundreds of dats simultaneously.

A solution for this kind of user has to be extremely streamlined and
entail very low maintenance.

!!!
    How to maintain many dats open? Especially given that dat-desktop
    is problematic?

(#) Detailed implementation proposals

To understand the proposals here, some knowledge of the dat software
ecosystem is needed. If you have any doubts below please read some
documentation on the dat project.

(##) Always-on local server

In order for users to be able to disconnect their personal computers
and "go home" a always-on server needs to be provided.

There are two alternatives currently: hashbase.io and using `hypercored`
with a list of dats. Both are problematic: hypercored requires manual
maintenance of the dat list. hashbase.io does not have enterprise
authentication mechanisms (e.g. LDAP or Windows AD) and makes
all feeds public.

Thus a LDAP-based dat-server is suggested:

1. The user logs in a web interface using her Enterprise credentials
2. Specifies the dat URL to be maintained
3. The server clones and provides the dats

Retention policies will be needed, probably with different user profiles.

(##) Off-campus server

As discovery and communication inside a private network is not completely
functional in dat, an off-campus server, *without* being constrained
by local network rules and firewalls might be necessary.


In this case, a simple instance of `hypercored` where the feed is
updated from the on-campus server might be sufficient.

!!!
    Merging the on-campus and off-campus server is possible, by
    hosting the on-campus server offsite. But
    enterprise authentication outside the local network might not be
    available.


There are two alternatives to this proposal:

- An internal server that is configured in a way that can be discovered:
  A public IP that is accessible from the inside
- Changing the firewall rules of the institution allowing internal
  communication via public IPs

The feasibility of these proposals will vary from place to place.

(##) HTTP bridging

Given that dat requires slightly open networks, its ability to
work on tighter settings might be reduced. A plausible example here
are many networks of the US Federal government.

While full dat functionality might be impossible in these cases, there
are mitigating strategies. For example making dats avaiable via HTTPS.

Such a service could support:

1. HTTPS download of a dat as a compressed file (zip, tar.gz, ...)
2. Browsing of the dat content and partial download


(#) Needs of industrial users

The workflow of industrial users needs further assessment. The interface
to the always-on local server will probably need to be tailored to
these users.