The Medical Data Works Railway service implements a privacy preserving federated infrastructure as proposed
by the Personal Health Train. The main principle of the Personal Health Train is to bring questions to the data rather than moving data.
This concept is called Federated Learning.
In this section, some Frequently Asked Questions are answered.
The Personal Health Train metaphor defines the following components:
The hardware requirements can be found here IT requirements.
All Trains are based on Docker. Inside the Train (docker image) is an algorithm that needs the data.
The basic requirement is therefore that the Station has to tell the Train where the data is.
This is typically done by providing the Train Docker container with the information like
In general no. Most projects we support, use a dump from the database (e.g. CSV or RDF) rather than setting up a SQL query.
See also the answer here. This does not mean it is impossible, but the station will need the SQL database connection URI,
so that this can later be shared with the Train.
We have tested such a scenario and that works. The choice is ultimately up to the consortium/project.
The Station will communicate its results back to the server using the HTTPS protocol. The results returned by the algorithm should be JSON serializable in the case of vantage6.
Note that for some applications (such as federated deep learning) the JSON may contain many model parameters and can thus be quite big.
For this reason a stable and reasonable fast internet connection is desirable.
This is technically possible but puts the burden on the Train provider to sort these differences out. So it is not advisable. See also the answer here.
Yes. The most important mechanism is trust between the Stations and the Trains. What the Train can and cannot so is laid down in a mandatory project agreement.
This limits the Train provider to the (research) question at hand.
Besides these legal and trust mechanism, every Station can see in the log what algorithms have been executed. Node administrators are additionally encouraged to set strong policies for their nodes. They can restrict which organizations or users are allowed to send Trains to their station. More importantly, they can specify with different levels of granularity which Trains are allowed to run on their Station by their image name, including the registry, repository and tag. Potentially only allowing a registry uniquely under the node admin's control, where vetted or otherwise trusted images for algorithms are stored. Also there is the option to code review and subsequently make the Train immutable. This ensures that the Train does what it is supposed to do.
One might wonder why there are no additional technical measures on the Track, such as a limit on the volume of data being shared across the Track. In earlier versions we had this limit, but it proved unworkable as 1] for some models (e.g. deep learning) the number of parameters is large and thus the size of the file being shared on the Track is large and 2] it gives a false sense of security as the data volume of a single patient may be small and below this threshold.
Another option to prevent this from happening is using advanced cryptographic technologies like homomorphic encryption (HE) or secure Multi Party Computation (sMPC). Using this technology, the machine learning is aggregated on encrypted data, or otherwise computed while keeping the partials private. This means that even if personal data is shared on the Track, it is virtually impossible to decrypt and use for those that lack the key (HE). Some versions of this might be possible to implement (algorithms) with vantage6. However, it comes with significant drawbacks in terms of performance and is not feasible except for the simplest of questions. This is an active area of research.
Typically this is a choice of the project and whether or not the Trains and Stations trust each other.
As said in this answer, trust is still an important condition in federated learning.
If trust is lacking, the best approach is for the Stations to appoint someone to review the code of the Train. After review of the code, it is possible to ensure that the Train uses that particular version of the code.
Someone checking the code should be versed in the data and the question at hand, and also understand the specific programming language used in the Train. Medical Data Works has no a prior accepted role in this, as promising this code review would mean Medical Data Works staff have to be experts in every research domain, question and programming language which is unfeasible.
We are working on a library of trusted & certified Trains. These are Trains that have been vetted by the vantage6 community incl. Medical Data Works and will be certified to be safe. In those Trains, the researcher only needs to configure certain parameters, but cannot change the code inside the Train. This will make it easier for researchers to use federated learning and for node administrators or data owners to trust executions of such tasks on their Station.
The Station is able to read the (intermediate) results that are being shared about their Station by the Train. However, note that for some machine learning algorithms such as deep learning this is a long file of parameters which may not make sense.
The software to be installed can be found here: IT requirements. The free version of Docker is sufficient.