Two-barcode demux requirement

Hello all,

This is a duplicate of a question I posted to the ARTIC pipeline GitHub issues (at github.com/artic-network/fieldbioinformatics/issues/52 – can’t actually link because I’m too new!). in case there are additional community experts who may see it here.

In the ARTIC SARS-CoV-2 SOP, it states “For the current version of the ARTIC protocol it is essential to demultiplex using strict parameters to ensure barcodes are present at each end of the fragment.” This is further elaborated on in the guide here, where it is explained that the main concern is due to the possible occurrence of in silico chimeric reads. This all seems fine and we have been following the recommended stringent Guppy demux settings for nanopore-based SARS-CoV-2 sequencing.

However, on some runs we are seeing a significant percentage of reads which are not being assigned into barcode bins because of the two-barcode requirement. The extent of the issue varies from run to run, but for example in the latest run, 78% of reads were thrown out because of this requirement. Based on the plot below, I suspect that the reason for this is that the barcode and adapter sequence are actually missing on one end of the reads. The reason for this is another issue that is more ONT-specific than ARTIC-specific, although any suggestions are welcome.

length_by_class.pdf

In any case, if I remove the two-barcode requirement and leave the rest of the parameters as default, only 7% of the reads are unclassified. Because of uneven coverage for some of the amplicons, the result for many samples is a drastic difference in completeness of the consensus genome sequence.

In doing some benchmarking of demultiplexing and the effect of tuning various parameters, I’ve started questioning the need for the two-barcode requirement. My logic is based on the fact that there multiple other filters and factors which mitigate the potential for issues caused by in silico chimeric reads. Although we’ve been using Guppy lately for demuxing at the same time as basecalling, the following points are specific to Porechop as that’s what I’ve used to investigate and benchmark the issue:

  1. Even when two barcode matches are not required, Porechop will not classify a read that has mismatched high quality barcode hits at both ends. If one end has a best hit to NB1 and a passing score, and the other end has a best hit to NB2 and a passing score within a certain distance of the first, the read is not classified. This is controlled by the --barcode_diff parameter of Porechop.

  2. Porechop will not classify a read that has a matching adapter sequence in the middle, as would be expected for chimeric reads. This is enabled by default when demultiplexing and in fact cannot be disabled.

  3. Perhaps most importantly, we are applying the size filtering using artic guppyplex as recommend in the SOP. Because the ncov19 amplicon sizes are fairly uniform and thus the read length distribution is fairly tight, the upper limit can be set conservatively, essentially precluding the possibility of chimeric reads passing through unless the contributing reads were too short to begin with.

  4. Again because of the narrow read length distributions, any significant population of chimeric reads would be obvious on the read length QC plot as a second peak. In practice, there is sometimes a small peak at approximately twice the ampicon size (this is before size filtering), but it is a very small fraction of the total population.

Clearly, the ideal solution is to fix the problem of apparent missing barcodes shown in the plot above. However, in the meantime we would like to make the best use of the data we have. My inclination, based on the above reasoning, is to remove the two-barcode requirement and re-process the data. I would appreciate any feedback from the ARTIC experts on the points above and if this would be considered acceptable practice.

Many thanks.

Thanks for your considered question!

There are a lot of ways to answer this question but the easiest way to really understand how well any given demultiplexing strategy is working is by reference to positive and negative controls.

Specifically:

  • negative controls (NTC, typically lab water taken through PCR) should be truly negative of on-target reads, with a tolerance for a few handful reads (<20)
  • positive controls, particularly if they are of an unrelated organism can guide how much barcode crossover there are by examining alignments to that taxa in other barcodes

If your hunch is right and these conditions are met in the controls, then it may be fine to process with single barcodes only. If you don’t have controls to guide, I would be extremely cautious of this approach.

I would agree that the length filter is highly effective at removing the bulk of chimeric reads, however.

Hope that helps.

@nick Thanks for your quick response. I agree that proper controls would greatly simplify the decision-making, and I’ll discuss this with the lab doing the sample collection and sequencing as they are scaling up to run more samples, but unfortunately they do not have controls for this run.

Just to play devil’s advocate (and setting aside the issue of potential amplification of contaminating sequence, which is a big issue to set aside but impossible to address without the proper controls in this case), how much does cross-talk between samples really matter in this workflow? The handful of papers I’ve read that specifically discuss the issue mention rates of cross-talk of anywhere from a few hundredths of a percent to one or two percent. I don’t see how that is going to effect a pipeline that is based on reference-based assembly and which is already designed to handle error rates in the reads well above the worst estimates of cross-talk I’ve seen in the literature. It’s not a diagnostic assay, after all (at least, that’s not how they are using it) and we’re not trying to call rare variants in a sample.

In the specific case of this dataset, which may well be an edge case, the choice is between throwing out an additional 70% of the reads (and being left with large gaps in the consensus), relaxing the thresholds and getting significantly more complete sequences, or re-running the samples with proper controls. I guess I will see if they have the funds to do the latter before thinking too much more about the former options.

Thanks again for your feedback.

Definitely encourage the use of controls, I think they are a must.

So in a typical random genomic DNA experiment where there should be generally even coverage of the genome through random sampling, low levels of contamination would not be expected to cause any significant issues with consensus sequence generation.

However, in amplicon sequencing the situation is notably different because you can very often get the situation with particular samples that fail to amplify some tiles at all (e.g. in the case of a low copy number viral genome, or a primer-binding site mismatch) but other samples generate high coverage of the corresponding tile. That’s a potentially majorly confounding situation for consensus generation as all of your reads could be coming from other samples in that region. It could end up looking like some strange recombination in downstream phylogenetic analysis if that tile contains a variant.

I wrote more of an explainer here, which you might find helpful:
https://artic.network/quick-guide-to-tiling-amplicon-sequencing-bioinformatics.html

Makes sense. Thanks again for the feedback.