Experiences with using dat for research data transfer
(#) Major Pitfalls - A summary
- Dat lacks a Windows GUI client, that is problematic for most users on that
platform as they are mostly not used to CLI tools
- Dat has problems with discovery inside a large private network. The
workaround it to put a server outside the private network to clone
dats for internal transfer
- There is a need for some kind of enterprise version of Hashbase: with
integrated authentication (say LDAP) and non-public dats
- The protocol can be problematic with tightly restricted networks (e.g. that
do not allow uncommon protocols)
- A minor issue is that software ecology around dat is not prepared to
deal with streaming sources (e.g. files that are growing and being
shared in real time)
(#) Using dat for transfer of research data
The dat protocol has the potential to be the building block of an
infrastructure to transfer research data.
Here I will be documenting the effort to setup a campus system for
that purpose. The requirements are:
- Ubiquitous: should be a single solution, available for all cases
- Simple: usage should be as simple as possible
- Works both for intra- and outside-campus transfers
- User empowering: very little dependency on everyone else
- It has to work with TB of data
(##) First experience: outside network with tight firewall
The first experience was a setup to transfer data to an outside
institution with a tightly regulated firewall.
It was a failure as the firewall did not allow the protocol to
work. Discovery was possible, but peer-to-peer did not work.
**Suggestion and notes**
- There should be a fallback to HTTPS built-in the protocol. It should
be 100% transparent to the user.
- This was a Windows machine, thus the easy to use dat client is not
available (only Linux and Mac)
- Beaker was tried and crashed
- `dat doctor` was then used to understand the situation
Note to self: check `dat sync --http`
(##) Second: intra-campus transfer
The dat protocol tries to find the public IP, of a 10.x.x.x machine,
that is available to the nodes and then (through a process
that I do not understand completely and need to study) attempts
peer-to-peer communication using the public IPs.
These are machines that are inside the same institution on a 10.x.x.x but
are not multicast-accessible.
The network policy does not allow communication among internal machines
via the public addresses. So it did not work.
**Suggestions and workarounds**:
- I suspect this setup is quite common. There should be a discovery
mechanism for this
- A workaround is to put a dat server *outside* the campus network for
discover-ability. We did this and now have a cloud machine just for this
The workaround is quite poor: it has costs and means that the data
needlessly travels in an out.
(##) Third: intra-campus transfer with outside node
With the outside node we can do intra-campus transfer.
**There are a few notes:**
- The dat GUI does not work on Windows, this means that Beaker is the
only solution for most users (which are not savvy enough - nor have
to be - to use the CLI)
- Beaker browser is not really thought to do this kind of work, so
it is unfair to think it should be an option here
This **was made to work** with dat CLI.
# Data production: empowering users
In these cases we were the producers of the data, so it was easy for
me to setup the sender side.
But we want users to be able to easily make their data available.
Other than the lack of a Windows GUI, all the pieces seem to be
That being said there is a problem: users want to shutdown their
machines and take them home (most cases are laptops). **But the
data should still be available to third parties**. Hosting
is thus needed.
I am developing a multi-user dat sharer to do this. This is like
hashbase but it will be connected to our LDAP server for
authentication. With the purpose to maintain a feed file for
# Data production: real-time dat's
Real-time data production is not well supported. For example starting
a dat source over a directory that is getting constant real-time data
There are ways around this. (Document)
# Data production: having lots of sources
Note to self: see the practicality of using the dat GUI to maintain
lots of sources