Google Summer of Code 2010 Proposal: Implementation of algorithm to infer gene duplications in BioRuby
Title:
Implementation of algorithm to infer gene duplications in BioRuby
Student:
Robert Kuo
Abstract:
This project will implement an algorithm to detect gene duplications in BioRuby described in Zmasek and Eddy, 2001, “A simple algorithm to infer gene duplication and speciation events on a gene tree”, Bioinformatics, 17, 821-828. The project will include full documentation, tests, and examples.
Content:
Name: Bob Kuo
(the following have been removed from here but they are in my official proposal to Google)
Address:
Email Address:
Mobile Phone:
IRC Handle: bubaflub
I am 24 and currently pursuing my masters and living in Champaign-Urbana, IL. My undergraduate degree was in Math and Computer Science from the University of Chicago at Illinois, with my coursework focusing on number and coding theory, numerical analysis, and algorithms. I am currently employed part-time as a web developer working in PHP and Ruby on Rails. I participated in last years’ Google Summer of Code by working with the Perl Foundation (http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2009/dukeleto/t124022226790) which was successfully completed ahead of schedule.
I am interested in BioRuby because I have always had an interest in combining the sciences with programming and am interested in learning more about evolutionary biology. I believe I am well-suited for this project not because I am the best Ruby programmer in the world, but because I am willing to learn and work through algorithms step-by-step. There is a reference specification written in Java from which I can test my Ruby project. My work as a web developer has required me to be multi-lingual and am comfortable reading and writing Java, Perl, PHP, and Ruby.
You can see my open source work at http://github.com/bubaflub, and would recommend looking through http://github.com/bubaflub/math–primality, my work from last years’ Google Summer of Code with The Perl Foundation which has significant in-line documentation and many tests.
All code developed for this project would be released on GitHub and available under the same terms as BioRuby itself.
Plan:
Before the start date (April 20th):
- Meet the BioRuby and Open Bioinformatics community
- Familiarize myself with the BioRuby package and the phyloXML format
- Read necessary papers (such as http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/9/821)
- Familiarize myself with the Java implementation of the algorithm (http://www.phylosoft.org/forester/applications/sdi/)
- Discuss and set expectations of code – dependencies, code style, tests, documentation, etc.
First half (May 23rd to July 13th):
- Decide on which libraries
- Spec out all necessary components with documentation and failing unit tests
- Write integration tests that cover the entire algorithm (i.e. if I input “A” I should get “B”)
- Begin implementing the algorithm
Second half (July 13th to August 10th):
- Finish implementing the algorithm
- Finish documentation and tests
In the even that I finish early:
- Profiling and speeding up existing code
- Extra documentation and tests
- More examples
- Extend the algorithm to use non-binary species and gene trees.
Obligations
I will continue to work part-time during the summer and may take summer classes. Last year these obligations were not problematic and did not affect my performance. During the day though I will be at work I will be available via email, IM, and IRC.
I welcome any feedback, comments, and critiques to this proposal.
Thanks,
Bob Kuo