Posted by: cjfields | December 15, 2014

Day 15 – Bioinformatics and the joy of Perl6


Yes, Perl6 is a reality :)

Originally posted on Perl 6 Advent Calendar:

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tail of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot…

View original 2,882 more words

Posted by: cjfields | August 30, 2014

Whither Bio::FeatureIO?

This is a brief post regarding why I decided to remove Bio::FeatureIO from the main core distribution.  The fact that it has taken this long for someone to notice is interesting.

tl;dr: In BioPerl, I would be wary of anything using Bio::SeqFeature::Annotated, or relying on having that particular module’s functionality. It’s considered deprecated; these really need to be converted to use Bio::SeqFeature::Generic.

A bit of back story

In BioPerl, the Bio::FeatureIO modules were centered around features, in particular GFF3 (and I believe Chado a bit), using Bio::SeqFeature::Annotated. The idea was that Bio::SeqFeature::Annotated would have pretty much everything type-checked. The idea is sound, but in this implementation everything was converted into different annotation objects (score, primary tag, all tag information, etc), and all features were checked against SO.

I believe this comes from some conflation about how data is stored in a feature: is it a simple key-value pair, or is it annotation? The idea here was to have this all converted to annotation, but checked against SO when needed using Bio::Ontology.

Now, b/c this completely breaks the SeqFeature API (e.g. $sf->score would return a Bio::Annotation::SimpleValue), the class overloaded all its accessors to print strings. This overloading was implemented in all Bio::Annotation and many other modules (Bio::Ontology IIRC was also involved). Additional code was additionally written to rely on Bio::FeatureIO.

This of course led to all sorts of hard-to-debug problems (Bio::Annotation was never meant to be overloaded, and lots of modules expecting objects got strings instead), not to mention performance issues.

Sadly enough, at the same time additional code was written that required this behavior. Then… the original developer basically quit working on the code, leaving an unfinished set of modules deeply integrated into BioPerl, changing behavior of several other core bits, and also having other code (Chado, and additional parser modules) reliant on them.

This ended up being the main blocker for a v1.6 ‘stable’ release. There was no way forward with the current implementation, and no replies from the developer in question to address the problems, so I had to rip that stuff out of core to basically put us back on the road to a new release.

My plan is to release Bio::FeatureIO as a separate distribution on CPAN, as it was originally written. This is most of the way there; I will likely release this in the next few days (it will have it’s own version number, v1.7.0). However, Bio::SeqFeature::Annotated is deprecated, and any code reliant on it should be migrated to another module. This was announced prior to the 1.6 release and reiterated on the mailing list a few times. Bio::FeatureIO is also being rewritten with a simpler seqfeature class in mind, possibly using Rob Buels’ Bio::GFF3 or Barry Moore’s GAL tools.


Just so this is clear, I have no hard feelings today towards the developer involved here. The idea in general made sense, even if the implementation in practice was problematic. I think it’s a great idea to try messing with core code (and I applaud them for trying, just as much as I smack my forehead that this wasn’t done on a branch)

What this has done, for me at least, is to try and promote discussing major code changes on the mail list, but suggest implementing them on a branch.  Strangely, no one ever seemed to work off a branch in BioPerl; maybe this has to do with the nature of the code, or that the project used Subversion at the time, or simply that no one understood how branches worked.  Thankfully, we switched to git and that has made a world of difference :)


Bio::FeatureIO v. 1.6.902 is now on CPAN.


Posted by: cjfields | March 29, 2014

Waking from a long sleep

…so, where was I?

It’s been a really crazy 4 years, a period that saw me

  • have a child with my wonderful wife,
  • switch jobs a few times (moving from post-doc to an academic career is hard),
  • almost leave Chambana,
  • and finally settle into a position where I’m working on the computational side full-time (ignore the black coat, I didn’t know it was picture day)

So, what’s next?

I would really like to return back to working on BioPerl a bit more.  The community has quieted down quite a bit (Note: this seems to be the case across all the various OBF projects, not just us, but the development side on the Python and Ruby end is still pretty active, not so much on the Perl end).  I don’t hold out much hope for a BioPerl project during Google Summer of Code this year (yes OBF is accepted), but I don’t really feel we’ve actually earned that this go around.


So, again, what’s next?

We need to start eating our own dog food.  But maybe that can be explained in more depth in the next post.


Posted by: cjfields | August 17, 2011

Broken NCBI GFF3

As Peter rightly points out, the GFF3 data that NCBI generates has more than a few problems, specification-wise.

I know that they have been notified of this many times; from what I understand this was to be addressed a few years ago, but unfortunately nothing has changed as of this morning.  It’s essentially useless.  Please don’t use it, at least until they acknowledge the problem and fix it.

And, if one is unfortunate enough to rely on NCBI’s GFF3 data, apologies in advance but the Open-Bio GFF3related parsers very likely won’t work as expected with them.  We can do our best, but GIGO.

Posted by: cjfields | September 19, 2010

Iron Man and Me

Well, I didn’t even realize I the request I had sent for being added to the Perl Iron Man went through (at least until I saw my name on the list).  Guess this means I need to get my ass in gear and really start on writing a few posts!

I’ll probably mirror my posts from here (or vice versa), but I anticipate my posts here crossing into a few other territories outside of Perl (biology and bioinformatics comes to mind).  I should be able to crank something out on a fairly regular basis Perl/BioPerl-related though, considering some of the changes we anticipate in the project to come.

Posted by: cjfields | July 24, 2009

What is this (pt2)?

So what is this?

#!/usr/bin/perl -w

use strict;
use warnings;
use Bio::Tools::Run::Infernal;

my $factory = Bio::Tools::Run::Infernal->new(-n => 10,
                -a => 1,
                -l => 1,
                -rna => 1,
                -tfile => 'trees.txt',
                -model_file => '',
                -program => 'cmemit',
                -outfile_name => 'seqs.stk');


Hint: even though it’s obviously Perl, apparently doesn’t support perl syntax highlighting, even though the original javascript source now supports it, so I’m using another language in the meantime.

I contacted support, I guess we’ll see how receptive they are (this is a free account after all).  I may have to attempt setting up my own blog server.

Update: Got a very quick response back from the folks indicating Perl isn’t supported (I gathered that).  However, they indicated they would look into upgrading, so there may be hope…

Update2 (March 2011): Perl is now supported!

Posted by: cjfields | July 20, 2009

What is this?

my $dna = "ttaagg"; sub translate($dna) { "FFLLSSSSYY!!CC!WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG".comb[map { :4($_) }, $dna.trans("tcag" => "0123").comb(/.../)] }; say translate($dna)

Answer (and an explanation) (btw, I’m pyrimidine)

Edit: Forgot to add, thanks to masak for that!

Posted by: cjfields | July 19, 2009

My Problem with Bio::Root::Build

For those who haven’t been keeping up, yes, it looks like we will be finally splitting up BioPerl into lots of smaller maintainable bits.  This will be covered in a separate post (and discussion will continue on the mail list).  I’m hoping both avenues will give us some feedback.

This post is in response to an friendly but on-going disagreement on the BioPerl mail list.  I want to emphasize this is not a personal frustration I am facing with anyone in particular, nor their coding style.  However, I think the topic has gone beyond the intended restructuring, though it is related, and I don’t think it makes sense for a mail list anymore, primarily b/c I hate interspersing comments (it’s completely unreadable after a point).

So I’m continuing the conversation here.  I’ll make sure comments are open, and any response or suggestion is welcome.  I hope this post answers most of the questions and clarifies the problems I have faced with this module when releasing 1.6.0, will help drive BioPerl restructuring, and will hopefully result in a reasonable solution.


Within the (currently monolithic) BioPerl core, we have created a common Module::Build subclass called Bio::Root::Build. This was done for a number of reasons, including working in some custom script installation, dealing with location of modules, testing for the presence of requirements, easy prereq module installation, etc.  From what I’ve been led to believe, the module’s intended purpose was to contain such methods so they could be used by any distributions requiring such functionality and requiring BioPerl’s core (currently, BioPerl-run, BioPerl-db, and BioPerl-network).  This also plays into the testing phase, when one may or may not want to run tests requiring an Internet connection.

Problem 1 :  API conflict between Module::Build and Bio::Root::Build

If we want a common BioPerl-specific Module::Build subclass for all BioPerl distributions, we are faced with a dilemma.  We want to keep a single copy somewhere in Subversion for obvious maintenance reasons (hence the reason for renaming to Bio::Root::Build, and moving it to core).  However, for it to be truly useful beyond core in any other BioPerl-requiring distribution we need it available for all of them.  This assumes of course that core is present, which isn’t always the case.

In order to do that, we need to check for it and ensure the proper core VERSION is being used.  If the latter fails for any reason, we have to install core first.  This, in turn, requires falling back to Module::Build, then indicating core is required for both the build and the installation steps (‘bootstrapping’).  If everything goes well, core is installed, the Build.PL script will restart and catch Bio::Root::Build, and then we progress forward.

Unfortunately, this is not currently possible due to several API conflicts, so at the moment subdist Build.PL’s just bail completely (i.e. die).  Not the optimal situation from an installation point-of-view, but it at least points out the problem.

Possible solutions

  • Only use Module::Build for those subdistributions, or
  • Fix the Bio::Root::Build API to comply with Module::Build’s API.

Going with the first option begs the question of what the purpose of Bio::Root::Build is in the first place (see above about ‘using a common custom Module::Build subclass for all BioPerl distributions’, again which I thought was it’s intended purpose).  The argument then is, why not use a much simpler Build process all around?

The problems with automated CPAN/CPANPLUS installation

I’m conflicted about this. I like the added functionality of allowing one to preinstall any prerequisites.  But at the same time, I’m not really sure there is true advantage to placing the burden of preemptively installing non-BioPerl-related ‘required/recommended’ modules on us, beyond making it somewhat easier for users (it’s questionable IMHO whether its beneficial at all to the developers).

Here are the key issues:

  • Currently, running Build.PL with core works for the general user.  However, installation of Module::Build under some circumstances appears to create an infinite loop due to CPAN recursion. This appears to be Gbrowse-centric and is related to Module::Build bugs on some systems, but it is crucial as a large number of users are installing BioPerl for Gbrowse via the net-install script.
  • The user is prompted to install such modules even within the CPAN/CPANPLUS shell.  Doing so appears to open up another shell within the currently running one.  This does work, but…
  • The automated installation forces the user to use CPAN alone.  For CPAN users, this isn’t a problem, but this may be causing issues with CPANPLUS installation and automated testing.
  • Related to the API issues above, CPANPLUS installation itself may be broken due to badly formatted META.yml.

Possible solutions

  • Make CPAN installation optional, and not available under conditions when the shell is running (this is detectable using env. variables).
  • Check for recursion and bail if it occurs (I believe this is similarly detectable with modern CPAN versions).
  • When in the shell, push any ‘recommended’ modules the user wants onto the ‘required’ stack and preinstall before running tests, and let CPAN/CPANPLUS take care of it.


I have made two specific proposals in my original email about this problem: we can either revert to Module::Build and simplify things, or we must fix Bio::Root::Build to respect the Module::Build API (yes, even if said API sucks or doesn’t do what you want).  I hope this will push a decision towards one of those two outcomes.

(PS – will add links soon)
Posted by: cjfields | July 1, 2009

NextGen Sequencing in BioPerl (pt 2)

Added in updated FASTQ IO to BioPerl for Illumina and Solexa sequencing.  I’ve also added in a raw data iterator for speedier parsing (as we all know, BioPerl can be slow).  Next on the list:

  • Make sure Phred quality score are correct (tests tests tests).
  • Indexing: should we try to incorporate a C-based indexer?
  • Did I mention speed?
Posted by: cjfields | June 18, 2009

NextGen sequencing and BioPerl

So, I haven’t posted to this in quite a while, primarily b/c I have been busy at the day job on issues non-computational (though that may change soonish).  However, I wanted to point out a recent  flurry of activity on the BioPerl mail list to incorporate NextGen sequencing somehow into BioPerl.  One of the questions that came up was whether this is even feasible within perl/BioPerl.  My own opinion, as well as that of Elia Stupka (the original poster and a fellow BioPerl dev), is that we need to just get everything working first, then optimize from there.

Of course that’s where the current focus should lie, but the long-term challenge will be how we can integrate possibly terabytes worth of data in a way that is accessible, something we need to take into consideration while we are designing. The notorious slowness of BioPerl (more a consequence of perl’s hammered-on OO, with a heavy reliance on inheritance and heavy use of has-a contained objects) could be partially alleviated by using Moose and Roles and lightweight objects, something I have been testing out in biomoose (and, if it proves fruitful, may spur it’s development more).  As much as I would like to code this up directly in Perl6 via bioperl6, I’m not sure Rakudo is capable of dealing with this at the moment.

Regardless, any thoughts are greatly appreciated!

Older Posts »



Get every new post delivered to your Inbox.