Large-Scale Data Management with Data Lakes

Large-Scale Data Management with Data Lakes

Wednesday, June 1, 2022 4:00 PM to 5:00 PM · 1 hr. (Europe/Berlin)
Hall E - 2nd Floor

Information

Large-scale data management is challenging for users and data centers. The users struggle to organize millions of files involved in scientific workflows and the involved software. Data centers suffer from the complexity of providing and optimizing storage environments without knowing the exact intent of the users. The creation of data management plans and a clear definition of the information life cycle and workflows serve the documentation, increase reproducibility, and portability. Many workflows integrate user-specific metadata into search engines allowing users to navigate data. Concepts such as data lakes and lakehouses become popular as a central storage. Data lakes aim to integrate data from diverse sources into a unified management system, retaining data in its original format. The idea is to dump scientific data into the lake organized following the FAIR principle. In addition, a research data management solution should not only ensure data preservation but also support scientists in complying with good scientific practices. Developing a good data management practice is difficult and domain-specific, therefore, the interaction with users with similar challenges accelerates the solution development. The aim of the BoF is to aid the community building in this topic and the discussion with the audience in order to find common problems and their individual solutions. First, several speakers from industry, data centers, and academia give lightning talks revolving around the topic of large-scale data management with a particular focus on data lakes and large-scale data management. In the second part, surveys, discussions, and community building takes place.
Contributors:

  • Julian Kunkel (Georg-August-Universität Göttingen)
  • Hendrik Nolte (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)
  • Stefano Claudio Gorini (CSCS/ETHZ)
Format
On-site