Enabling an automated pulsar searching workflow with Machine Learning and database integration

An artist’s impression of the Murchison Widefield Array (MWA) detecting a pulsar signal (Credit: Dilpreet Kaur, CSIRO). Processing the data collected for SMART to find such pulsars is like finding a needle in a haystack. Having a robust database linked to our processing workflows is critical for this to become a reality, and that is where this project comes in!.

The Southern-Sky MWA Rapid Two-Metre (SMART) pulsar survey is an ambitious project that is moving into its next stage of processing, with 16 discoveries so far and many more around the corner. Given the large data volume and number of linked processing tasks, the SMART pipeline (Nextflow-based) manages and stores a vast amount of metadata and information about the status of various stages of analysis at any given moment. There are also a large number of “false” pulsar candidates which we need to sift through — an ideal task for machine learning (ML). We have developed a basic ML classification scheme and the database that can nominally handle this kind of information. One of the missing links is to have the database and workflows (including, but not limited to the ML classifier) interact in real-time! In this project, the student will develop and integrate a series of software tools that can be used to gather data from and send data to the database as the automated SMART workflow crunches through four petabytes of pulsar search data. This will help streamline the processing workflow enormously, and thus accelerate the rate of pulsar discoveries and the science that can be extracted from them.

Student attributes
Academic background	Enrolment in any Computing or Data Science course is appropriate. The applicant should have an interest in astrophysics, but it is not required to be enrolled in the astronomy stream.
Computing skills	Unix operating systems (required), Python (required), Machine Learning techniques (required), Relational databases (ideal), workflow management (ideal)
Training requirement	Nextflow, using supercomputing systems

Project timeline
Week 1	Inductions and project introduction
Week 2	Initial presentation
Week 3	Familiarisation with current ML scheme and pulsar candidate data
Week 4	Familiarisation with current database structure + creating a mock database
Week 5	Generating “fake” data (e.g., ML scores, pulsar candidate info.) and identifying required interactions with mock database
Week 6	Writing/testing Python functions automatically create and execute database calls
Week 7	Testing/updating Python functions to interact with “deployed” database
Week 8	Integrating scripts into SMART workflows
Week 9	Final presentation
Week 10	Final report