Experiences with using dat for research data transfer


(#) Major Pitfalls - A summary

- Dat lacks a Windows GUI client, that is problematic for most users on that
  platform as they are mostly not used to CLI tools
- Dat has problems with discovery inside a large private network. The
  workaround it to put a server outside the private network to clone
  dats for internal transfer
- There is a need for some kind of enterprise version of Hashbase: with
  integrated authentication (say LDAP) and non-public dats
- The protocol can be problematic with tightly restricted networks (e.g. that
  do not allow uncommon protocols)
- A minor issue is that software ecology around dat is not prepared to
  deal with streaming sources (e.g. files that are growing and being
  shared in real time)

(#) Using dat for transfer of research data
  

The dat protocol has the potential to be the building block of an
infrastructure to transfer research data.

Here I will be documenting the effort to setup a campus system for
that purpose. The requirements are:

- Ubiquitous: should be a single solution, available for all cases
- Simple: usage should be as simple as possible
- Works both for intra- and outside-campus transfers
- User empowering: very little dependency on everyone else
- It has to work with TB of data

(#) Attempts

(##) First experience: outside network with tight firewall

The first experience was a setup to transfer data to an outside
institution with a tightly regulated firewall.

It was a failure as the firewall did not allow the protocol to
work. Discovery was possible, but peer-to-peer did not work.

**Suggestion and notes**

- There should be a fallback to HTTPS built-in the protocol. It should
  be 100% transparent to the user.
- This was a Windows machine, thus the easy to use dat client is not
  available (only Linux and Mac)
- Beaker was tried and crashed
- `dat doctor` was then used to understand the situation

!!!
   Note to self: check `dat sync --http`

(##) Second: intra-campus transfer

The dat protocol tries to find the public IP, of a 10.x.x.x machine,
that is available to the nodes and then (through a process
that I do not understand completely and need to study) attempts
peer-to-peer communication using the public IPs.

These are machines that are inside the same institution on a 10.x.x.x but
are not multicast-accessible.

The network policy does not allow communication among internal machines
via the public addresses. So it did not work.

**Suggestions and workarounds**:

- I suspect this setup is quite common. There should be a discovery
  mechanism for this
- A workaround is to put a dat server *outside* the campus network for
  discover-ability. We did this and now have a cloud machine just for this
  purpose
  
  
The workaround is quite poor: it has costs and means that the data
needlessly travels in an out.

(##) Third: intra-campus transfer with outside node

With the outside node we can do intra-campus transfer.

**There are a few notes:**

- The dat GUI does not work on Windows, this means that Beaker is the
  only solution for most users (which are not savvy enough - nor have
  to be - to use the CLI)
- Beaker browser is not really thought to do this kind of work, so
  it is unfair to think it should be an option here

This **was made to work** with dat CLI.

# Data production: empowering users

In these cases we were the producers of the data, so it was easy for
me to setup the sender side.

But we want users to be able to easily make their data available.

Other than the lack of a Windows GUI, all the pieces seem to be
in place.

That being said there is a problem: users want to shutdown their
machines and take them home (most cases are laptops). **But the
data should still be available to third parties**. Hosting
is thus needed.

I am developing a multi-user dat sharer to do this. This is like
hashbase but it will be connected to our LDAP server for
authentication. With the purpose to maintain a feed file for
`hypercored`.

# Data production: real-time dat's

Real-time data production is not well supported. For example starting
a dat source over a directory that is getting constant real-time data
updates.

There are ways around this. (Document)

# Data production: having lots of sources

!!!
   Note to self: see the practicality of using the dat GUI to maintain
   lots of sources