To evaluate whether transcriptional noise could be contributing to inaccuracies in our retrotransposed gene predictions, we performed quantitative comparisons of our predictions with microarray-based profiles for four distinct RNA classes: mRNA, pre-mRNA, antisense RNA, and rRNA. Our approach to determine whether D. melanogaster Apollo exons present in non-melanogaster species could be coding or non-coding is straightforward: if Apollo exons are transcribed, we predict that they are untranslated and are not coding. Examining the distribution of predicted Apollo exon lengths in D. melanogaster, we found that highly transcribed genes tend to produce shorter exons (mean = 45 nucleotides, among >500 exons), while genes with no predicted Apollo exons (81% of all genes) tend to produce exons >50 nucleotides in length (median = 87 nucleotides). This indicated that, in D. melanogaster, the shorter exons were not transcribed. We also compared Apollo exon lengths and their transcript abundances in D. melanogaster, D. simulans (Dmel_apollo_Dsimulans.fa.gz), D. ananassae (adult vs. embryonic, Dmel_apollo_Danae.fa.gz), D. erecta (Dechro.fa.gz), D. mojavensis (Dmoj_Dgenome_wcellanagl.fa.gz), D melanogaster_mod_r3.17 (Dmel_Dgenome.mod.r3.17.fa.gz), and four other Drosophila species using Spearman correlation. Three of the four classes of parasite (antisense RNA, rRNA, and pre-mRNA) had no significant correlations between exon lengths and transcript abundance in D. melanogaster. We also detected no significant correlation for Apollo exons among other species in the other three parasite classes as well. Apollo exon lengths, therefore, do not correlate with their downstream transcripts, suggesting that Apollo transcripts in non-melanogaster species are not spurious products of transcriptional noise.
For each non-melanogaster species, approximately 6000 genes were retained for this study. Approximately 50% of these genes were aligned with the D. melanogaster genome, and the remaining genes could be aligned with orthologs in other arthropods (Table 1). Of the non-melanogaster genes aligned with the D. melanogaster genome, the vast majority (74–86%) exhibited clear orthology with D. melanogaster genes. As with the four genes predicted to evolve before the evolution of D. melanogaster, we term the remaining genes 'long-branch' non-melanogaster species-specific genes. To determine how many of the non-melanogaster species are sufficiently diverged to permit the assignment of orthology between their genes and D. melanogaster, we estimated the minimum number of gene duplications since the speciation of a common ancestor of the studied species. For this analysis, because of the implausible number of gene duplications and resulting chimeric gene models that would occur if we assumed a single whole-genome duplication event (compared with a single 4-fold higher rate estimated from the D. melanogaster duplication history), we instead identified canonical single-copy genes only (see Core genes, below). After assessing the taxonomic distribution of these genes, we found that a subset of 6 non-melanogaster species (D. erecta, D. immigrans, D. littoralis, D. suzukii, D. mojavensis and D. melanogaster) can each classify single-copy genes as orthologous to D. melanogaster. In cases where orthology could not be unambiguously determined, we assessed the quality of alignments and used the results of our hierarchical testing approach to resolve gene relationships. This procedure allowed us to unequivocally assign orthology for the vast majority of the genes with alignments of at least 100 nucleotides, and for the majority of the genes with alignments of less than 100 nucleotides (see Additional file 3). Thirty-five genes/exons were assigned orthologous relationships with D. melanogaster in all five of the species included in this study, while 74 genes/exons were assigned orthologous relationships with at least four of the five species. X chromosome genes from D. d2c66b5586