Posts

Showing posts from September, 2025

GSoC 2025: Final Submission and a Look Ahead

And with that, the Google Summer of Code 2025 coding period comes to a close. I’ve just submitted my final term report for the RIPPLe project, marking the culmination of an intense and incredibly rewarding four-month journey. Looking back to May, I remember the steep learning curve of just setting up the LSST Science Pipelines environment. The initial weeks were a deep dive into the complexities of the Butler, its data repositories, and the unique coordinate systems used by the Rubin Observatory. There were moments of frustration, especially when dealing with environment configurations and performance bottlenecks, but each challenge was a significant learning opportunity. From the excitement of the first successful end-to-end pipeline run to the satisfaction of optimising the workflow and seeing the processing time drop from hours to minutes, this project has been a masterclass in building real-world scientific software. The result of this effort is RIPPLe: a robust, configurable pipel...

Finalizing the Pipeline: Testing and Refinements

 This is the final push. I've been spending the last few days adding a suite of unit tests to validate the behaviour of individual components and ensure the pipeline's robustness. 

The Final Frontier: Documentation and Usability

 As the project nears completion, I’m focusing on what is arguably one of the most critical phases: documentation. A tool is only as good as its documentation, and my goal is to ensure that another developer or researcher can get up and running with RIPPLe with minimal friction. I’ve been writing a comprehensive README, adding detailed docstrings to all major classes and functions, and creating a series of Jupyter Notebook tutorials that walk through common use cases (thinking of postponing notebook tutorials for later, though). It’s meticulous work, but it’s essential for the long-term success and adoption of the project within the DeepLense and wider LSST communities.

Decoupling Configuration from Code with YAML

 With the core performance issues addressed, I’ve shifted my focus to improving the pipeline's usability and maintainability. This week, I moved all the configuration parameters—such as cutout sizes, normalization methods, model paths, and detection thresholds—out of the Python code and into external YAML files. This change allows researchers to easily experiment with different settings without having to modify the source code. It also makes the pipeline far more flexible and adaptable to different scientific use cases. It’s a crucial step in transforming the project from a custom script into a reusable and configurable scientific tool.

Implementing a Parallel Processing Workflow

 The refactoring work is complete, and the results are promising. By leveraging Python’s multiprocessing module to create a pool of data workers, the pipeline is now able to overlap data I/O and preprocessing with model inference. What previously took hours to run on a large dataset now completes in a matter of minutes. The GPU utilization has increased significantly, and the performance metrics are finally within the targets I set at the beginning of the project. This architectural change was a major undertaking, but it was essential for building a pipeline that can realistically handle the scale of LSST data.

Profiling the Pipeline and Hunting for Bottlenecks

 This week was all about performance analysis. I’ve been using Python’s profiling tools to get a detailed breakdown of where the pipeline is spending its time. The results confirmed my suspicions: a significant amount of time is lost to I/O-bound operations and redundant preprocessing steps that are not being efficiently batched. Based on this analysis, I've started refactoring the core processing loop. The plan is to implement a producer-consumer pattern, where a pool of worker processes is dedicated to fetching and preparing data, feeding a steady stream of tensors to the GPU for inference. This should decouple data preparation from model execution and allow for much higher throughput.