GSoC Week 1–3

17 Jun, 2024

Time flies! I can’t believe it’s already 3 weeks since I started working as a GSoC contributor this summer at QC-Devs. It has been a bumpy start for me since there was a lot of things on the plate. I just wrapped up my third year at UC Irvine, whoo! Now I can fully focus on our GSoC project.

Background

For my GSoC project “Interactive Web Interface for Selector”, I will build an interface for a python library called Selector that provides methods for selecting a diverse subset of a dataset using Streamlit.py.

Although Python is easier to work with compared to other programming languages, it’s still difficult for medical chemists who have very little programming experience to use our tool to perform subset selection. Therefore, this interface we are building could help potentially thousands of scientists to perform data analysis more efficiently. Our goal for the project is to host it on HuggingFace, and we will share the associated Docker image for public use.

Summary of the Past 3 Weeks

At the current stage, I am working to have a very simple working interfaces of one or two algorithms in the package. On top of that, I am working on setting up the CI/CD pipeline to connect ourGithub repo with Dockerhub and HuggingFace for deployment.

There was a lot of moments that I was so desperate because of all kinds of bugs that came from setting up the CI/CD pipeline because of Docker and HuggingFace credentials. Lucky enough, I was still able to make everything happen at the end. Earlier this afternoon, I raised the first pull request for our project, which could be merged into the main branch soon. This is a milestone of our project as it established the connection of our package to public in a variety of platforms such as Docker and HuggingFace.

What I was doing in Week 1

I kind of took things easily during the first week since I was still in my school quarter. I had the first meeting with my mentor Fanwang. He is a Postdoc at Massachusetts Institute of Technology. We got the chance to get to know each other, and we also talked about some of the project expectations and goals.

I made a forked version of our repo where I coded the intro interface for our package during the first week. I also spent some time reading our documentation and codebase to better understand the input of algorithms, which I was very confused at the beginning. After things got more settled, I started to work on the interface for the first algorithm: MaxMin.

What I was doing in Week 2

I coded up the first version of MaxMin in the beginning of the second week. Essentially, our algorithms build on top of a base selection class. It has three parameters: the input matrix that are either in feature or distance shape (required) in csv, xlsx, npz, or npy file format, the number of points for selection, and a cluster label list (optional) that are in csv or xlsx format.

The package have a very high unit test coverage (96%). Thanks to that, I can just make some modification to the test file so that it generates some input files for me to test the interface. At the end, we were able to receive the same output as the unit tests by running the algorithm using the interface.

Now that the output is identical, it’s also that we make the output format flexible to the user. Therefore, I added the option so that the output can be exported to csv or json format, which are some of the widely used file formats in the industry due to their portability.

At the end of Week 2, I also polished up the Dockerfile that will be used to pack our application into a docker container. This is a topic that I was learning just before our project starts in this academic quarter. Docker makes dependency packaging a breeze since it just wraps everything and starts the application inside a container, where you can specify open ports that will direct users to different services without the need to worry about differnt operating system, dependencies, etc…

One thing I do want to complain about Docker is that it has a very rigorous login process. I spent quite some time to set the Github Actions workflow up so that any changes in the repo will trigger the workflow to build and push the Docker image to Dockerhub, which serves as a public repository similar to Github. However, I wasn’t very familiar with HuggingFace at the time yet, so I didn’t include the process of continuous deployment to HuggingFace in the workflow file at the time.

What I was doing in Week 3

I wrapped up my third year at UC Irvine during week 3. It’s unbelievable that time passes so fast. Anyways, I can fully focus on our project now.

This is where the trick part comes in, which is also the place I got stucked a few days to solve. I did some exploration of how HuggingFace deploys applications. Since our project uses Docker, we can just create a Docker space on HuggingFace. Essentially, HuggingFace is almost the same as Github. It has a repo that you can make public or private to host your source code.

For Docker spaces, it will look for the Dockerfile you have in the root of the folder in order to build and host the application. I also found that HuggingFace relies on the README.md file in the root folder as the configuration file of the Docker Space, such as application name, open port, etc… This becomes problematic since our repo already have a README.md. Therefore, I set up a step in our Github Actions workflow so that it will swap the README.md I prepared for Docker Space with our original README.md.

However, the frustrating part came in. HuggingFace rejects my push even I have the correct command and secret token. It indicates that my repo contains binary files even it was not presented in my repo at the time. Upon investigation, I noticed that it was in the river of git commit history in our repo. At the beginning, I tried a command of git filter-branch to clean up the commit history of our repo. This made HuggingFace accepts my push operation, and it hosted the application live finally! However, the downside of this is that it will rewrites nearly everything in the git commit history, which is why there was about 900 changes when I try to open a pull request💀

I stucked on this problem for quite a few days ever since then. I also tried to locate the exact commit, and I tried to remove tracking those binary files with Git with a git rebase. However, enormous merge conflicts quickly pops up, which discouraged me to keep proceeding. As I was walking outside today, I suddenly realized that we can do git filter-branch during the Github Action workflow rather than apply it to our repo on Github. As I thought, this fixed the problem right away. Now, our repo remains clean, and the CI/CD pipeline is completed.

After some polishing of the MaxMin algorithm interface, I raised the first pull request for our project🚀

What’s Next

Now the basic interfaces and CI/CD pipeline are established, I can focus on implemting the details of other algorithms. I will also make any necessary changes through the code review with my mentor to make sure the first pull request got merged. Hopefully I can make that happen in the next week!