Jun Xia

GSoC Week 4–5

Summary of the past two weeks

At this stage, the main focus is to finish building the interfaces for other selection algorithms. A list of algorithms supported by our Selector package is available here.

My goal was to finish all the interfaces for the package by the end of the two weeks. Since the most frustrating part(CI/CD) is already set up, the development workflow was kind of smooth during this period of time, and I was able to make it all happen at the end😁

What I was doing in Week 4

I raised the first pull request at the end of Week 3 consisting of a basic interface for the MaxMin algorithm and the CI/CD pipleline for deploying our Streamlit application on DockerHub and HuggingFace. I meet with Fanwang, who serves as my mentor this summer, this week to discuss current progress of the project. I am off a pretty good start per our discussion, and we are confident that we can finish the project within the timeframe.

Fanwang left some comments for my first pull request, mostly code conventions within the organization. I worked on resolving these comments for a day or two. For example, I bumped the version of our python version from 3.9 to 3.11 to align with the trend of industry. It caused some problem when building the docker image, but I was able to resolve them mostly since they are just some update in the tool or syntax with dependencies.

I started working on the interface for the MaxSum and DISE algorithm. This is where interesting things come along. I figured out that our package is written using inheritance technique which makes its input format consistent throughout the algorithms. With that being said, a lot of code in our interface could be reusuable as well, especially for parts such as upload_matrix, run_algorithm, etc… Therefore, I did some optimization of the overall structure by creating a centralized utils.py **that can be used for all the interfaces, which greatly reduced the amount of code that needs to be duplicated. After some work, I finished the interface for **MaxSum and **DISE **at the end of the week.

What I was doing in Week 5

I started the week by wrapping up some bugs left from the changes last week since we have a utils.py which does the heavy lifting now. It caused some problems for earlier algorithms such as MaxMin and MaxSum, nothing big thou. After some modification, I raised a PR for MaxSum and DISE.

Fanwang also found that it might be beneficial to publish the docker image in Github releases so that user doesn’t need to go to a third-party website like DockerHub. I added it to our roadmap which I will work on it once I finish all the interfaces.

With most of the infrastructure set up, I completed the interface for the OptiSim algorithm in a relative quick manner, where it wraps up the last distance method of the package.

At the end of Week 5, I was able to finish the last two Partition method of the package, which are GridPartition and Medoid. I will discuss with Fanwang next week to see if there is any high-level comment he have of the current progress. I believe we will focus more on how to make the interface more user-friendly since it’s just a working prototype right now.

What’s Next

Now the interfaces are mostly complete. I will focus on how to make them more user-friendly to our scientitsts to help them make subset selection more efficiently. Also, I’ll make the publishing of docker image on Github work in the next two weeks, along with any other changes that might be needed for the project. Looking forward to it!