The alarming explosion of genome sequencing data was recently addressed in PLoS Biology (Big Data: Astronomical or Genomical), and touched on in Nature News (Genome researchers raise alarm over big data). The authors compared sequencing data with three other big data generators: astronomy, YouTube, and Twitter. These demand massive computing resources for data acquisition (astronomy), storage (astronomy, YouTube), analysis (Twitter), or distribution (YouTube). However, sequencing data presents large demands for all of these.
But I think it is most interesting to look at some of the current statistics:
– Illumina has released their new HiSeq X series (can be obtained in series of five or ten sequencers): in 3 days, each can sequence 6 billion reads at paired end, 150 bp/read (i.e. 1.8 Tb, or ~15 human genomes at 30x), every 3 days! 3 days!!
– there are currently more than 3.6 petabases of raw data in NCBI’s SRA: ~32 000 microbial genomes, ~5 000 animal and plant genomes, and ~250 000 human genomes — but the current sequencing capacity is estimated to be 35 petabases/year
– the authors of the PLoS Biol paper show that since 2009, our sequencing capacity has doubled every 7 months; compare this with Illumina’s estimate of doubling every 12 months, or Moore’s law that of doubling every 18 months
– by 2025, the authors estimate we will need between 2-40 exabytes (100 M – 2 B human genomes, or 2-40 B Arabidopsis genomes) of data storage per year