Motivation: nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or non genomic data (RNA sequencing data should be aligned against the entire "theoretically possible" transcriptome which comprises exon-intron boundaries which have to be recomputed for each different read length). Methods: we developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover we added the possibility to perform lossy compression, losing some of the original informations (IDs and/or qualities) but resulting in smaller files. When qualities are not stored in the compressed file it is possible to define a cutoff under which corresponding base calls are converted to N. Results: we achieve 2.84 to 7.84 compression ratios on various fastq files without losing infos and 5.38 to 8.93 losing IDs, which are often not used in common analysis pipelines. In this article we compare the algorithm performance with known tools, usually obtaining higher compression levels. This software should be useful for whose that are interested in using next generation sequencing data for further bioinformatics analyses.

KungFQ: a Simple and Fast Approach to Compress Fastq Files

GRASSI, ELENA;MOLINERIS, Ivan
2011-01-01

Abstract

Motivation: nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or non genomic data (RNA sequencing data should be aligned against the entire "theoretically possible" transcriptome which comprises exon-intron boundaries which have to be recomputed for each different read length). Methods: we developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover we added the possibility to perform lossy compression, losing some of the original informations (IDs and/or qualities) but resulting in smaller files. When qualities are not stored in the compressed file it is possible to define a cutoff under which corresponding base calls are converted to N. Results: we achieve 2.84 to 7.84 compression ratios on various fastq files without losing infos and 5.38 to 8.93 losing IDs, which are often not used in common analysis pipelines. In this article we compare the algorithm performance with known tools, usually obtaining higher compression levels. This software should be useful for whose that are interested in using next generation sequencing data for further bioinformatics analyses.
2011
BITS Annual Meeting 2011
Pisa
20-22/06/2011
-
1
1
E.Grassi; F. Di Gregorio; I. Molineris
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/140783
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact