This is a duplicate of a question I posted to the ARTIC pipeline GitHub issues (at
github.com/artic-network/fieldbioinformatics/issues/52 – can’t actually link because I’m too new!). in case there are additional community experts who may see it here.
In the ARTIC SARS-CoV-2 SOP, it states “For the current version of the ARTIC protocol it is essential to demultiplex using strict parameters to ensure barcodes are present at each end of the fragment.” This is further elaborated on in the guide here, where it is explained that the main concern is due to the possible occurrence of in silico chimeric reads. This all seems fine and we have been following the recommended stringent Guppy demux settings for nanopore-based SARS-CoV-2 sequencing.
However, on some runs we are seeing a significant percentage of reads which are not being assigned into barcode bins because of the two-barcode requirement. The extent of the issue varies from run to run, but for example in the latest run, 78% of reads were thrown out because of this requirement. Based on the plot below, I suspect that the reason for this is that the barcode and adapter sequence are actually missing on one end of the reads. The reason for this is another issue that is more ONT-specific than ARTIC-specific, although any suggestions are welcome.
In any case, if I remove the two-barcode requirement and leave the rest of the parameters as default, only 7% of the reads are unclassified. Because of uneven coverage for some of the amplicons, the result for many samples is a drastic difference in completeness of the consensus genome sequence.
In doing some benchmarking of demultiplexing and the effect of tuning various parameters, I’ve started questioning the need for the two-barcode requirement. My logic is based on the fact that there multiple other filters and factors which mitigate the potential for issues caused by in silico chimeric reads. Although we’ve been using Guppy lately for demuxing at the same time as basecalling, the following points are specific to Porechop as that’s what I’ve used to investigate and benchmark the issue:
Even when two barcode matches are not required, Porechop will not classify a read that has mismatched high quality barcode hits at both ends. If one end has a best hit to NB1 and a passing score, and the other end has a best hit to NB2 and a passing score within a certain distance of the first, the read is not classified. This is controlled by the
--barcode_diffparameter of Porechop.
Porechop will not classify a read that has a matching adapter sequence in the middle, as would be expected for chimeric reads. This is enabled by default when demultiplexing and in fact cannot be disabled.
Perhaps most importantly, we are applying the size filtering using
artic guppyplexas recommend in the SOP. Because the ncov19 amplicon sizes are fairly uniform and thus the read length distribution is fairly tight, the upper limit can be set conservatively, essentially precluding the possibility of chimeric reads passing through unless the contributing reads were too short to begin with.
Again because of the narrow read length distributions, any significant population of chimeric reads would be obvious on the read length QC plot as a second peak. In practice, there is sometimes a small peak at approximately twice the ampicon size (this is before size filtering), but it is a very small fraction of the total population.
Clearly, the ideal solution is to fix the problem of apparent missing barcodes shown in the plot above. However, in the meantime we would like to make the best use of the data we have. My inclination, based on the above reasoning, is to remove the two-barcode requirement and re-process the data. I would appreciate any feedback from the ARTIC experts on the points above and if this would be considered acceptable practice.