DeepConsensus

Contributed to v0.2 release

I reconceptualized the flow of data in DeepConsensus, moving from a big data Beam pipeline to a much simpler structure that kept subreads grouped together. This simplification resulted in most of the 10X speedup in release v0.2. I did this by writing the quick_inference.py script, first in 1 day, which immediately yielded a huge speedup and removed a lot of overhead in the runtime. Then I continued to optimize with batching and parallel processing from there. You can see the script from v0.2 release here: quick_inference.py, but unfortunately the history of commits wasn’t preserved when this code history was externalized from Google’s internal non-git code tracking system.

Led the v0.3 release

I was tech lead for the DeepConsensus v0.3 release. The primary goal was to speed up the model significantly without losing accuracy. The speedup we achieved made it possible to integrate the model into PacBio’s Revio instrument to run automatically as part of the sequencing itself, which was especially important as the throughput of the Revio instrument became so high that getting subreads off the instrument for separately running with DeepConsensus would exceed the bandwidth. We hit this goal, achieving a runtime speedup of 4.9X compared to v0.2.

I also changed the key metric we optimized for to “yield at empirical Q30”, and I kept integrating new results from the team to make charts like this that helped us track our progress in speedup and accuracy. The team continues to use these metrics, visualizations, and concepts to track improvements to DeepConsensus to this day (as of release v1.2, 3 years later) https://github.com/google/deepconsensus/blob/r1.2/docs/yield_metrics.md

Then I made this video to communicate to users and the scientific community what we had been working on: