Quantitative Study of the Impact of Advanced Lossy Compression Techniques on Radio Astronomy Science Results

This PhD project has the potential to save the Square Kilometre Array Observatory millions in storage and processing costs! The SKA will produce up to 1 TB/s of raw, uncompressed data streaming into the dedicated processing facilities in both South Africa and Australia. This data will then be processed using a variety of scientific workflows and the results finally stored in multiple facilities around the world. Standard compression algorithms struggle quite a bit with the very noisy data, the SKA will produce. When experimenting with some of the lossless algorithms, in the most extreme cases, the compressed data turned out to be larger than the original, while spending a significant amount of time in compressing and decompressing. There are now very sophisticated algorithms available, which allow both lossy and lossless compression on multiple scales, adaptive and dedicated to the type of axis in multi-dimensional data cubes. This includes JPEG2000 and also MGARD. In particular the latter provides guaranteed error bounds, i.e. the decompressed data will not deviate by more than the specified error bounds and that means in turn that the actual loss is fully quantifiable. In addition these techniques can be applied in a hierarchical way, providing very fast access to higher levels of compression, but can still also maintain the lossless layer. The project aims to apply such lossy compression methods in a systematic way to radio astronomical data along the processing chain and provide a quantitative assessment of the impact on the science results. More concretely compression can be applied during the reception stage, when the data is streaming into the processing facilities, on intermediate data products and/or on the final data products. Each of those use cases will need to be carefully constructed, executed and analyzed as part of this project. In order to establish the ground truth, we will use simulated data sets, but then also move on to hybrid and real data sets. This will require to run actual, complex data reduction workflows on these data sets and understand the mathematical concepts behind the individual steps. It will also be required to carefully select science quality metrics for the assessment of the impact. Questions to be answered include: Where in the workflows is it most efficient, but is still not causing critical loss of scientific information? How much scientific information is lost, depending on the compression level? Can and should we apply it in more than one place? What is the impact on data I/O and archive costs? A more advanced question would be: What is the relation, if any, between the compression and some of the core radio astronomy sky reconstruction algorithms?