Data is no longer the bottleneck for discovery. Instead it our ability and capacity to work with data, to turn data into information and knowledge, that is limiting. The question then becomes, how do we empower researchers and enable discovery, by scaling people's capacity to work with data? First we must recognize the importance of valuing the people by investing in training and valuing the people who create the tools and software. We need to view software not as a service to research, but an integral part of the research process. Then we must develop and provide training, access to computational resources, and give credit for and support research software development. With this support and training, everyone can have the tools and perspectives to work effectively and reproducibly with data. People are empowered to answer the questions that are important to them, allowing us to address more and more diverse questions, and broadening the impact on science and society. We talk about bringing compute to data or data to compute. We also need to bring people to data. It is only by democratizing data that it will reach its potential.
Imagine that you have a wonderful pipeline that finds, rapidly, OTUs (operational taxonomic units). Imagine that this pipeline meets a growing success in the metagenomic community.
As engineers working on bioinformatics platforms you want to package it and distribute it. Despite the fact that this pipeline is already wrapped for Galaxy (yes, you can Find, Rapidly, OTUs with a Galaxy Solution aka FROGS [Escudie et al., Bioinformatics, 2017]), the point is that this pipeline is built upon on more than 20 dependencies. Again, you are an enthusiast engineer but you are far from being an expert of packaging and deployment.
In this talk we want to relate the journey of packaging and distributing a pipeline that enables a significant amount of dependencies. We are engineers in charge of setting up production quality bioinformatics services for the community of users of several French academic research institutes (IFREMER and INRA). On a daily basis, part of our work consists in installing and configuring softwares and we would like to have these steps as efficient as possible. At the beginning of the journey we had no particular skills in technologies for packaging of softwares and their deployment on a mutualized infrastructure.
We will focus on the learning curve for the technologies and tools employed (CONDA and PLANEMO). We will show how (i) the fact that all team members decided to learn together and (ii) the support and reactivity of the developer community has speed up the learning process. We will also relate the early installation experiences done by people not involved in the packaging and with also no particular skills in dependencies resolution technologies.
Based on this experience, we will give a feedback on some easy-to-implement rules that would greatly simplify tasks to package, test and deploy complex bioinformatics pipelines right into Galaxy.
Some public Galaxy servers are bound by available resources, including data storage limits applied on a per-user basis. With growing dataset sizes and number of biological samples, limited available resources can restrict researchers’ capacity to do large-scale research using these public servers.
We are extending Galaxy to support the ability of users to plug-in a variety of storage resources in a federated fashion. This federated vision of storage resources integrates physically disconnected persistence resources from a Galaxy instance. We are enhancing Galaxy to support two models of federated storage resources: instance-wide and user-based. The instance-wide configuration allows all the data for a Galaxy installation to be stored on a cloud computing platform (e.g., Amazon, Google) or other remote storage resources while Galaxy automatically manages data storage, caching, and persistence. The user-based option will allow each individual user of a Galaxy instance to associate external, additional storage resources with their account (e.g., Amazon S3). This storage media will be made available in addition to any storage already provided by the public server and the user will have an option to define rules (e.g., priority and quota) that determine what data gets stored on what remote resource. For example, a researcher can associate an AWS S3 bucket and an Azure Blob container with their account with a rule specifying that up to 200GB of data is stored in S3 while the rest is in Azure, or associate individual histories with a specific storage resource. Galaxy will then automatically support the specified distribution topology. This allows a user to aggregate resource from multiple sources while respecting resource limits.
Among other applications, the proposed federated storage can be advantageous in joint data analysis scenarios. For instance, a Galaxy user can execute an analysis pipeline using jointly private data hosted on Amazon S3 and a collaborator’s data persisted on Microsoft Azure Blob, while Galaxy handles the details of moving data for processing and persisting results.
Toward this end, we have extended Galaxy by making it an OpenID Connect relying party, so that users are not required to share their credentials with Galaxy, and can revoke or restrict privileges of a Galaxy instance at anytime. This is a significant advance that enables Galaxy users to log in using their external accounts (e.g., Google or home institution) and use these external identities to connect Galaxy to cloud computing resources. We have also extended the Galaxy ObjectStore to run on a per-user basis and partially support the above-described usage model. We have integrated the Galaxy ObjectStore with CloudBridge, which enables Galaxy to read/write from multiple cloud-based storage resources, including Amazon, Azure, Google, and OpenStack. At present, we are implementing methods to compute a checksum for each dataset, so that data integrity is guaranteed when transferring to and from attached storage. The talk will demonstrate the ability for Galaxy to store datasets on multiple cloud storage providers with minimal configuration.
It is by now well established that software plays an increasingly important role in scientific research. We are gradually recognizing the importance of tools that are robust, well-tested, documented, and that support reproducible research practices. However, the development of such tools is a complex,resource-intensive process for which most scientists are poorly trained, and that is rarely recognized by the incentive structures of professional research.
In this talk, I will discuss some of the work we have done in Project Jupyter from the perspective of these issues. I will highlight some recent technical developments in Jupyter that may be of interest to the GCC/BOSC audience, and will discuss lessons we have learned in over 15 years of building these tool sat the intersection of research, education and industrial partnerships.