S. X. Hu, V. V. Karasiev, V. Recoules, P. M. Nilson, N. Brouwer, and M. Torrent, Nat. Commun. 11, ().

The pandemic also fuelled a sharp rise in sharing through preprints articles posted online before peer review , advanced the output of male authors over female authors and affected review times — speeding them up in some topics but slowing them down in others. Scientists published well over , articles about the coronavirus pandemic in Estimates differ depending on search terms, database coverage and definitions of a scientific article.

At first, COVID papers and preprints focused on the spread of disease, the outcomes for people hospitalized, and diagnostics and testing, according to an analysis of the topics of PubMed-indexed articles by Primer , a company in San Francisco, California, that develops artificial-intelligence AI technologies.

Source: J. Inglis, medRxiv. MedRxiv COVID preprints appeared in peer-reviewed journals after a median review time of 72 days, twice as fast as preprints from the server on other topics, says Inglis. He gives credit to journal editors and publishers for pushing their peer-review systems to work faster, and scientists for agreeing to review many more papers than usual.

He adds that pandemic-related preprints published in the first quarter of appeared in journals more rapidly than those published later, which might be evidence of strain in the system. And as the virus moved to ravage Italy, the number of papers from scientists there swelled.

And the most-cited preprint 5 — a 16 March report from pandemic modellers at Imperial College London that estimated how lockdown and other distancing measures could avert millions of deaths — had a significant effect on UK policy and made worldwide headlines. That preprint is also the article that attracted the most buzz on social media, according to Altmetric, a London-based firm that monitors metrics other than citations.

The pandemic publishing frenzy had winners and losers. This is probably because women shouldered the burden of childcare and home-schooling during lockdowns, says Flaminio Squazzoni, a social scientist at the University of Milan, Italy, who co-authored the preprint analysis. The same effect was not seen in peer review, where men and women received and accepted invitations to evaluate papers at around the same rate.

This is distorting the rewards of science. There were also research-publishing scandals. Given the volume of coronavirus research, that proportion is about the same as for research in general. Typically, it takes three years for editors to retract a paper, but during the pandemic it has taken just months — in part because these papers are facing so much scrutiny.

Clarification 17 December : This story now notes that preprints were posted on multiple sites, so estimates may represent slight overcounts. Squazzoni, F. Aviv-Reuven, S. X, Fry, C. Huang, C. Lancet , — PubMed Article Google Scholar. Ferguson, N. Vincent, M. Virology J. Article Google Scholar. Andersen, K. Nature Med. This missing feature has motivated us to develop a new program to improve the compression of the Ion Torrent files for long term archiving.

Reducing the space consumption of NGS data reduces the cost of storage and data transfer. Therefore, developing efficient compression software for clinical NGS data goes beyond the computational interest; as it ultimately contributes to the overall cost reduction of the clinical test. The space saving achieved by our tool is a practical step in this direction.

This technology is particularly popular in the medical domain, because it is fast and cost effective. It is basically used for clinical gene panels and whole exome sequencing. Gene panels are used to read the sequences of selected genes to screen for variations related to some inherited disorders [ 1 , 2 , 3 , 4 , 5 ] and cancer [ 6 , 7 ]. Whole exome sequencing covers the whole set of genes and is mostly used to identify novel mutations and genes [ 8 , 9 , 10 , 11 , 12 ].

The Ion Torrent technology is not favored for whole genome sequencing due to its limited throughput, which would lead to insufficient depth for clinical use. For clinical labs, the NGS data should be retained for a certain period of time [ 13 ]. This requirement necessitates that the NGS lab possesses a high capacity storage systems either in site or in the cloud. For either option, the cost of data storage is part of the total cost for provisioning the service per sample.

Therefore, efficient data compression should be implemented to reduce the storage footprint, which in turn reduces the cost of the test. For medical applications, the NGS analytical pipeline starts with the step of base calling, where the physical signals either images or electrical signals are translated to sequences of nucleotide bases.

The output of this step is a sequence file composed of a set of reads in the fastq format as in Illumina technology or in the unaligned BAM format as in Ion Torrent technology. The read is the sequence of a DNA fragment. The next step of the pipeline is to align the NGS reads to the reference human genome. If the user runs the alignment and variant calling workflow, then the reads are aligned to the reference human genome and the results are kept in an aligned BAM format.

The unaligned reads are kept as well in the aligned BAM file but without mapping information. The final step of the analysis pipeline is the variant calling step to identify variants mutations compared to the reference human genome. The challenge in this step is to discriminate genuine variants from sequencing errors. The output of this step is tabular file VCF format including list of mutations.

Analysis , where the NGS files are accessed to run the alignment and variant calling steps of the variant analysis workflow. This phase requires direct access to the reads from a fast storage at very high IO speed. It is preferred to run this step on SSD based storage [ 15 ]. This phase does not involve computation, and it is fine that the data moves to moderate speed storage hard-disk based. The interpretation phase terminates by issuing a clinical report to the patient with the findings and the case is then considered closed.

Long term archiving , where the data can move to high capacity slow storage disk based or tapes and kept inert, unless needed. The BAM file is the largest output of this step and this is the one that should be the main target of compression. The Gene Panel file is in the range of 1G—10G, but usually one runs multiple samples in the same run.

The VCF files are relatively small and they are in the range of a few Megabytes. Optimizing the cost of the storage is critical for the third phase including long term archiving, where the data is kept inert for long time and is only decompressed if needed. The recent survey papers [ 16 , 17 , 18 ] include a description and comparison of these software tools.

Broadly, these tools can be categorized into two big groups: 1 Non-reference based compression and 2 Reference based compression. Non-reference based methods compress the data by making use of its intrinsic characteristics. Reference-based methods work as follows: They first align the reads to a reference sequence. Then they compress the alignment information, which is enough to decompress the reads given the reference sequence.

The reference based methods achieve high compression ratio, because the reads are almost identical to the reference except for few individual variations and sequencing errors. Reference and non-reference based compression tools can have a lossy and lossless version. For medical applications, only the lossless version should be used.

For medical applications, where the human genome hg19 or GRC38 is used as a reference, the reference-based compression would be the method of choice for compressing the NGS data. Fritz et al. The flow signal vectors represent the measurements corresponding to the change in pH during base hybridization. The flow signal data cannot be discarded because it is used by the Torrent Suite to improve the accuracy of the variant calling.

This shows that there a room for improvement and extra compression can be achieved by targeting the flow signals with a special compression procedure. In this section, we provide information about the flow signals and explain how they are generated and stored in the BAM file. Ion Torrent is a Next Generation Sequencing technology based on the use of CMOS semiconductor chips, where the DNA bases are determined by sensing the release of hydrogen atoms during the hybridization process [ 21 , 22 ].

Each single-stranded fragment is attached to a bead a particle called ion sphere , where it undergoes a reaction to produce multiple copies of the same fragment. These copies are referred to as the template. The beads are then moved to the sequencing CMOS chip. The chip is composed of millions of wells and each well includes a sensor to detect the change in pH. Ideally, each ion sphere should reside in one well in the sequencing ship.

The chip is then placed in the sequencer and the sequencing process proceeds as follows: The sequencer introduces the four bases A, T, G, and C one at a time during the run in a cyclic fashion. The order in which the nucleotides are introduced is referred to as flow cycle. An introduced nucleotide hybridizes to the template base if it is complementary to it, and a change in pH takes place. If the template at one site includes a polymer e. If no change is measured in one round of the cycle, then the base in the template does not match the one in the flow cycle and no hybridization reaction takes place.

A wash step occurs after the introduction of each type of nucleotide to ensure no nucleotide remains in the well before the introduction of the next one in the flow cycle. The changes in pH at each round in the flow cycle are recorded, and a vector called the raw flow-signal is produced. The signal processing software analyzes the raw flow signals and produces a vector of processed flow signals that are eventually stored in the BAM file [ 21 , 22 ].

The flow signals are numerical integer values, usually bounded in practice. The number of flow signal points is the same as the number of bases in the flow cycle. The string defining the flow cycle is stored once in the header of the BAM file.

As also shown in the figure, each read includes information related to the quality and alignment. The key and bar code sequences are ligated pre-pended to the fragment. The key sequence TCAG is a control sequence to ensure correct sequencing. A barcode sequence is added to a certain group of fragments. The lower part of the figure shows a schematic representation of the fields in the SAM file. The header part includes the flow cycle and the key sequence. Each line in the SAM file represents one read, aligned to the reference genome.

We show the fields including the read ID, the physical position and the CGAR string which represents the alignment, the bases of the DNA sequence in the read, the quality field, and the flow signals in the ZM field. Figure 2 explains the steps of the base calling by demonstrating how the flow signals are analyzed to call the bases of an example fragment using a given flow cycle.

The base calling software uses the flow signals to call the bases in the target DNA as follows: The algorithm simultaneously scans the flow signal and the bases in the flow cycle. If there is a signal peak exceeding a certain threshold, then the corresponding base in the flow-cycle is the base in the target DNA and it is reported. If the flow signal value doubles, this indicates a polymer of identical bases. The base calling software calibrates the signal values and decides the length of the homopolymer.

Theoretically, the flow signal value can go to infinity for a DNA fragment of infinite number of the same nucleotide, e. Base calling based on flow signals. The upper part shows an example DNA fragment to be sequenced. The second part shows the sequence of nucleotides in the flow cycle.

It also shows the values of the sensed flow signals and the called bases. A flow signal value exceeding a certain threshold means that a base had hybridized to the template and the corresponding base in the flow cycle is reported. The base calling software calibrates the signal values and decides the length of the polymer. Our approach to improve the compression of the Ion Torrent BAM file is based on improving the compression of the flow signals. The idea of our algorithm is that the reads with similar sequences aligned to the same locus should have similar flow signals.

Therefore, exploiting such similarity across multiple identical reads would lead to better compression. Our algorithm sorts the reads in the BAM file first by genomic coordinates then by their prefix via sorting the respective CIGAR string in order to bring the similar reads closer to each other. By scanning the sorted reads, the algorithm identifies blocks of reads mapped to the same locus. We collect the flow signals in each block and compress them together as detailed in the algorithm below.

Other fields of the BAM file are compressed using Scramble [ 20 ]. Sort the BAM file if not sorted by genomic coordinates. Separate the signals of the forward reads from those of the reverse ones and process each group independently in parallel using Steps 3 and 4.

Remove the flow signals from the BAM file and store them separately. Compress the remaining fields of the BAM file sequence, quality, and other fields using a reference based method We use the program Scramble [ 20 ] for this step. Define blocks of flow signals, such that the reads in each block are mapped to the same locus. Each block B can be processed in parallel using the steps 4.

For each D j, allocate a vector V j1 of n bytes to store its values. The length of V j2 list equals the number of values larger than in D j and they are very rare in practice. Concatenate F 1 and the V vectors and compress them. We use the XZ algorithm as default method for that purpose. F 1 is a reference flow signal vector that will be used in decompression. Wait until all parallel processes finish.

Use the Linux tar package to create a compressed folder including the compressed B blocks files and the other compressed CRAM files computed in Step 1. In actual implementation, Steps 2 and 3 are implemented together via Linux pipes. For Step 4. The other option increased the running time and did not lead to tangible improvement of compression. So, we decided to use F 1 as the reference flow signal vector.

The XZ method is the default one. All these implementations are based on the dictionary based approach using Lempel Ziv decomposition. Each tool implements different tuning steps in terms of encoding and algorithm engineering. Zstd is a Facebook developed package, also based on LZ77 but enhanced with tuned levels of compression using Finite State Entropy [ 24 ].

Zstd follows speed-first design approach and it can provide ultra-compression ratios. For Step 5, we use the Linux tar package for creating an archive of all compressed files. This archive includes the CRAM file for the input BAM file minus the flow signals computed in Steps 2 and 3, and the compressed blocks for the flow signals computed in Step 4.

The decompression algorithm starts with un-archiving the tar folder using the tar program. For each block of the compressed flow signals, the V vectors are decompressed and the D vectors are reconstructed. The vector F 3 is used to reconstruct F 4, and so on. The decompressed flow signals are finally added to the BAM file.

First, the flow signals of the forward and reverse reads are processed in parallel. Second, the compression of the blocks to compress the flow signals can also run in parallel. Third, Scramble compresses the BAM file minus the flow signal in parallel. Finally, one can decompose the BAM file intro sub-files, each correspond to a certain genomic region.

These regions are independent from one another and they can be also processed in parallel. Parallel processing is also used during decompression. We decompress the BAM part which was compressed by Scramble in parallel with decompressing the flow signals.

Also the compressed flow signal blocks are decompressed in parallel. The benchmarking dataset included many genomic files from different technologies and different organisms. This low depth is no longer used in practice, neither in research nor in clinical diagnosis.

To cope with recent advances in the Ion Torrent technology, we compiled a dataset for Ion Torrent BAMs, whose depth is similar to what is used in clinical practice Table 1. For an up-to-date version of the kits, chemistry, and analysis package, we also added a set of three test exomes and eleven test gene panels, generated at clinical grade quality from the Saudi Human Genome Program.

All these files are available to download from the program website. The size of the target region is The average depth is the average number of reads covering a target base. As we mentioned in the introduction, the flow signals occupy a big portion of the BAM file. In this experiment, we measured how big that portion is in the test dataset.

As a measure of compression, we use the percent space saving defined as follows:. We used this measure because it directly reflects the amount of saving in physical storage, which directly leads to cost reduction.

The table shows the average file sizes and average space saving for each group of files. Supplementary File 1 includes the details for each test file. From the experiments, we observe little improvement of compression when the depth increases. The gene panel files with higher depth are compressed little bit better than those with lower depth.

The reason for this is that these public exomes were sequenced using older chemistry and an older base calling program. The new chemistry achieves more consistent readings of the signal at the same position in the read and accordingly lead to more similar flow signal value, which ultimately leads to better compression. Table 4 shows the performance of IonCRAM using these different compression options to compress the flow signal part. It can be observed that XZ achieves the best compression.

Zsdt is in second place with very comparable results to XZ. Space saving of IonCRAM using different options: Columns 4, 6, and 8 show the average file sizes after compression using the options xz, gzip, and Zstd, respectively. Columns 5, 7, and 9 show the percentage space saving with the options xz, gzip, and Zstd, respectively. Supplementary File 1 includes detailed experiments in tabular and graphical formats.

It is also important to speed up the transmission of data and overcome the bandwidth issues. News 17 JUN Close banner Close. The chip is composed of millions of wells and each well includes a sensor to detect the change in pH. The signal processing software analyzes the raw flow signals and produces a vector of processed flow signals that are eventually stored in the BAM file [ 21, 22 ].
Funding Not Applicable. View author publications. Clinical sequencing: from raw data to diagnosis with lifetime value. Therefore, exploiting such similarity across multiple identical reads would lead to better compression. Columns 5, 7, and 9 show the percentage space saving with the options xz, gzip, and Zstd, respectively. The pandemic also fuelled a sharp rise in sharing through preprints articles posted online before peer review, advanced the output of male authors over female authors and affected review times — speeding them up in some topics but slowing them down in others.
Andersen, K. References The Saudi Mendliome Group. The built-in software for the Ion Torrent sequencing machines delivers the sequencing results in the BAM format. A high performance storage appliance for genomic data. J Mol Biol.
