MiSeq v1 vs MiSeq i100: Quality and Index Hopping Analysis
Author
EMC2 Project
Published
December 2, 2025
Show code
import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom pathlib import Pathfrom itables import show as itables_showimport itables.options as itables_opts# Configure itables defaultsitables_opts.maxBytes =0# Disable size limit warning# Set display optionspd.set_option('display.max_columns', None)pd.set_option('display.precision', 4)
Executive Summary
This report compares sequencing quality and index demultiplexing between two MiSeq platforms:
MiSeq v1: Standard MiSeq with continuous quality scores
MiSeq i100: Updated MiSeq with 4-bin quality score system
Key Findings:
Quality Score Binning: MiSeq i100 uses a 4-bin quality score system (Q2, Q9, Q23, Q38) vs continuous scores (Q0-Q40) on v1
Index Hopping: i100 shows significantly higher undetermined reads, suggesting potential index mixing
Read Quality: Despite binning differences, Q20/Q30 rates are comparable between platforms
Background
Quality Score Binning
The MiSeq i100 uses a binned quality score system to reduce file sizes:
Bin
Phred Score
ASCII Character
Meaning
1
Q2
#
Very low quality
2
Q9
*
Low quality
3
Q23
8
Medium quality
4
Q38
G
High quality
Standard MiSeq v1 provides full-resolution scores from Q0 to Q40+.
Study Design
31 EMC2 samples sequenced on both MiSeq v1 (dual index) and MiSeq i100
All samples use dual 8bp indices (i7 + i5)
QC performed with fastp
Quality Score Distribution
Show code
# Load quality score distribution dataqual_file = Path('quality_scores_distribution.csv')if qual_file.exists(): qual_df = pd.read_csv(qual_file)# Calculate percentages qual_df['pct'] = qual_df.groupby('platform')['count'].transform(lambda x: x / x.sum() *100 )else: qual_df =Noneprint("Quality score distribution file not found. Run analyze_quality_scores.py first.")
An important difference between platforms is how they handle uncertain base calls:
Show code
import jsonfrom collections import defaultdict# Collect N reads data from fastp JSON filesn_data = []for platform, path in [('MiSeq i100', 'qc_results/miseq-i100'), ('MiSeq v1', 'qc_results/miseq-v1'), ('MiSeq v1 Dual', 'qc_results/miseq-v1-dual')]: platform_path = Path(path)if platform_path.exists():for f in platform_path.glob('*.json'):withopen(f) as fh: d = json.load(fh) too_many_n = d['filtering_result']['too_many_N_reads'] total_reads = d['summary']['before_filtering']['total_reads']# Get max N content from content curves r1_n = d.get('read1_before_filtering', {}).get('content_curves', {}).get('N', []) r2_n = d.get('read2_before_filtering', {}).get('content_curves', {}).get('N', []) max_n =max(max(r1_n) if r1_n else0, max(r2_n) if r2_n else0) n_data.append({'platform': platform, 'total_reads': total_reads,'too_many_n': too_many_n, 'max_n_content': max_n})if n_data: n_df = pd.DataFrame(n_data) summary = n_df.groupby('platform').agg({'total_reads': 'sum','too_many_n': 'sum','max_n_content': 'max' }).reset_index() summary['pct_n_filtered'] = summary['too_many_n'] / summary['total_reads'] *100print("N Base Call Summary by Platform:")print("="*65)for _, row in summary.iterrows():print(f"\n{row['platform']}:")print(f" Total reads: {row['total_reads']:,.0f}")print(f" Reads filtered (too many N): {row['too_many_n']:,.0f} ({row['pct_n_filtered']:.4f}%)")print(f" Max N content at any position: {row['max_n_content']:.4f}")print("\n"+"-"*65)print("KEY FINDING: MiSeq i100 reports ZERO N base calls!")print("-"*65)print("""The i100's 4-bin quality scoring system never outputs 'N' (ambiguous)bases. When the sequencer is uncertain about a base call, instead ofmarking it as 'N' with a low quality score, it assigns one of thelow-quality bins (Q2 or Q9) with a definite A/T/C/G call.This has implications for: • Downstream analysis that relies on N content for filtering • Variant calling pipelines that use N's to identify problem regions • Quality metrics that count ambiguous bases""")# Check for G enrichment at low-quality positionsprint("\nG Content at Low-Quality Positions:")print("-"*65)for platform, path in [('MiSeq i100', 'qc_results/miseq-i100'), ('MiSeq v1', 'qc_results/miseq-v1')]:import numpy as np all_qual = [] all_g = [] platform_path = Path(path)if platform_path.exists():for f inlist(platform_path.glob('*.json'))[:10]:withopen(f) as fh: d = json.load(fh) q = d.get('read1_before_filtering', {}).get('quality_curves', {}).get('mean', []) g = d.get('read1_before_filtering', {}).get('content_curves', {}).get('G', [])if q and g: all_qual.append(q) all_g.append(g)if all_qual: mean_qual = np.mean(all_qual, axis=0) mean_g = np.mean(all_g, axis=0) low_qual_idx = np.argsort(mean_qual)[:5]print(f"\n{platform} - Positions with lowest quality:")for idx in low_qual_idx:print(f" Position {idx+1}: Q={mean_qual[idx]:.1f}, G={mean_g[idx]*100:.1f}%")print("""On i100, positions 298-299 show >90% G content at lower quality scores.This mirrors the poly-G pattern in index reads - when sequencing signalfails on two-color chemistry, 'no signal' is called as G, not as N.""")
N Base Call Summary by Platform:
=================================================================
MiSeq i100:
Total reads: 5,451,386
Reads filtered (too many N): 0 (0.0000%)
Max N content at any position: 0.0000
MiSeq v1:
Total reads: 2,934,798
Reads filtered (too many N): 52 (0.0018%)
Max N content at any position: 0.0116
MiSeq v1 Dual:
Total reads: 4,008,800
Reads filtered (too many N): 4,530 (0.1130%)
Max N content at any position: 0.0208
-----------------------------------------------------------------
KEY FINDING: MiSeq i100 reports ZERO N base calls!
-----------------------------------------------------------------
The i100's 4-bin quality scoring system never outputs 'N' (ambiguous)
bases. When the sequencer is uncertain about a base call, instead of
marking it as 'N' with a low quality score, it assigns one of the
low-quality bins (Q2 or Q9) with a definite A/T/C/G call.
This has implications for:
• Downstream analysis that relies on N content for filtering
• Variant calling pipelines that use N's to identify problem regions
• Quality metrics that count ambiguous bases
G Content at Low-Quality Positions:
-----------------------------------------------------------------
MiSeq i100 - Positions with lowest quality:
Position 300: Q=30.2, G=4.9%
Position 298: Q=35.6, G=98.2%
Position 299: Q=36.1, G=92.2%
Position 1: Q=36.8, G=0.2%
Position 297: Q=36.9, G=1.4%
MiSeq v1 - Positions with lowest quality:
Position 301: Q=9.9, G=0.7%
Position 287: Q=16.9, G=2.5%
Position 288: Q=18.8, G=0.7%
Position 300: Q=19.9, G=2.3%
Position 289: Q=19.9, G=1.1%
On i100, positions 298-299 show >90% G content at lower quality scores.
This mirrors the poly-G pattern in index reads - when sequencing signal
fails on two-color chemistry, 'no signal' is called as G, not as N.
“Phantom samples” are index combinations where both i7 and i5 indices are from real samples in the study, but the specific combination was never paired together. These represent true index hopping/swapping between samples during library prep or sequencing.
Show code
phantom_file = Path('phantom_samples_miseq-i100.csv')if phantom_file.exists(): phantom_df = pd.read_csv(phantom_file)print(f"Found {len(phantom_df)} phantom sample combinations")print(f"Total reads with swapped indices: {phantom_df['count'].sum():,}")iflen(phantom_df) >0:print("\nPhantom Samples (interactive table - click column headers to sort):") itables_show( phantom_df[['index_pair', 'count', 'i7_from_samples', 'i5_from_samples']], order=[[1, 'desc']], # Sort by count descending scrollY="400px", scrollCollapse=True, paging=True, pageLength=25 )else:print("\nNo phantom samples detected - minimal index hopping between samples!")else: phantom_df =Noneprint("Phantom samples file not found. Run analyze_miseq_comparison.py first.")
Found 427 phantom sample combinations
Total reads with swapped indices: 182,364
Phantom Samples (interactive table - click column headers to sort):
Loading ITables v2.5.2 from the internet...
(need help?)
Show code
if phantom_df isnotNoneand undet_df isnotNone: total_undet = undet_df[undet_df['platform'] =='MiSeq i100']['total_undetermined'].values[0] phantom_reads = phantom_df['count'].sum() iflen(phantom_df) >0else0 g_pattern = undet_df[undet_df['platform'] =='MiSeq i100']['both_g_reads'].values[0]print("Index Hopping Summary (MiSeq i100):")print(f" Total undetermined reads: {total_undet:,}")print(f" Reads with GGGGGGGG+GGGGGGGG: {g_pattern:,} ({g_pattern/total_undet*100:.1f}%)")print(f" Phantom samples detected: {len(phantom_df)}")print(f" Reads from index swapping: {phantom_reads:,} ({phantom_reads/total_undet*100:.1f}%)")
Index Hopping Summary (MiSeq i100):
Total undetermined reads: 7,042,878
Reads with GGGGGGGG+GGGGGGGG: 3,937,905 (55.9%)
Phantom samples detected: 427
Reads from index swapping: 182,364 (2.6%)
Figure 3: Top Phantom Samples (Index Swapping Between Samples)
Indices with Partial Study Matches
Looking at index pairs where only one index (i7 or i5) matches the study indices:
Show code
# Load ALL sample indices (not just EMC2)expected_file = Path('all_sample_indices.csv')if expected_file.exists(): expected_df = pd.read_csv(expected_file) unique_i7 =set(expected_df['i7_index']) unique_i5_rc =set(expected_df['i5_revcomp'])# Reload undetermined for i100 undet_i100_file = Path('qc_results/miseq-i100/Undetermined_S0_indices.txt')if undet_i100_file.exists(): undet_raw = []withopen(undet_i100_file) as f:for line in f: parts = line.strip().split()iflen(parts) >=2and'+'in parts[1]: count =int(parts[0]) i7, i5 = parts[1].split('+') undet_raw.append({'i7': i7, 'i5': i5, 'count': count}) undet_check = pd.DataFrame(undet_raw)# i7 matches study, i5 doesn't i7_only = undet_check[ (undet_check['i7'].isin(unique_i7)) & (~undet_check['i5'].isin(unique_i5_rc)) ]['count'].sum()# i5 matches study, i7 doesn't i5_only = undet_check[ (~undet_check['i7'].isin(unique_i7)) & (undet_check['i5'].isin(unique_i5_rc)) ]['count'].sum()# Neither matches neither = undet_check[ (~undet_check['i7'].isin(unique_i7)) & (~undet_check['i5'].isin(unique_i5_rc)) ]['count'].sum() total = undet_check['count'].sum()print("Partial Index Match Analysis (MiSeq i100):")print(f" Only i7 from study: {i7_only:,} reads ({i7_only/total*100:.1f}%)")print(f" Only i5 from study: {i5_only:,} reads ({i5_only/total*100:.1f}%)")print(f" Neither from study: {neither:,} reads ({neither/total*100:.1f}%)")
Partial Index Match Analysis (MiSeq i100):
Only i7 from study: 1,160,700 reads (16.5%)
Only i5 from study: 1,321,170 reads (18.8%)
Neither from study: 4,378,644 reads (62.2%)
Comparing negative controls (Kit blank and PCR blank) across platforms helps assess background contamination levels.
Show code
if qc_df isnotNone:# Filter for blank samples (case-insensitive matching) blank_df = qc_df[ qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False) ].copy()# Categorize by blank type blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(lambda x: 'Kit Blank'if'kit'in x else ('PCR Blank'if'pcr'in x else'Other') )iflen(blank_df) >0:print("Blank Samples Found:") display(blank_df[['sample_id', 'platform', 'total_reads_before','total_reads_after', 'q20_rate_after', 'q30_rate_after','pct_reads_lost', 'blank_type']].style.format({'total_reads_before': '{:,.0f}','total_reads_after': '{:,.0f}','q20_rate_after': '{:.4f}','q30_rate_after': '{:.4f}','pct_reads_lost': '{:.1f}%' }))else:print("No blank samples found")
Blank Samples Found:
sample_id
platform
total_reads_before
total_reads_after
q20_rate_after
q30_rate_after
pct_reads_lost
blank_type
27
EMC2_kit
MiSeq v1
26
18
0.8839
0.7898
30.8%
Kit Blank
33
EMC2_kit
MiSeq v1
172
56
0.8755
0.7683
67.4%
Kit Blank
36
EMC2_PCR
MiSeq v1
84
26
0.8658
0.7596
69.0%
PCR Blank
66
EMC2_Kit
MiSeq v1 Dual
220
136
0.7866
0.6445
38.2%
Kit Blank
67
EMC2_PCR
MiSeq v1 Dual
96
16
0.7481
0.5818
83.3%
PCR Blank
97
EMC2_Kit
MiSeq i100
1,086
1,084
0.9945
0.9813
0.2%
Kit Blank
98
EMC2_PCR
MiSeq i100
800
794
0.9924
0.9745
0.8%
PCR Blank
Show code
if qc_df isnotNone: blank_df = qc_df[ qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False) ].copy() blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(lambda x: 'Kit Blank'if'kit'in x else'PCR Blank' )iflen(blank_df) >0:# Summarize by platform and blank type (take max if duplicates) blank_summary = blank_df.groupby(['platform', 'blank_type']).agg({'total_reads_after': 'max','total_reads_before': 'max' }).reset_index() fig = px.bar( blank_summary, x='platform', y='total_reads_after', color='blank_type', barmode='group', title='Blank Sample Read Counts Across Platforms', labels={'total_reads_after': 'Reads (After Filtering)','platform': 'Platform','blank_type': 'Blank Type' }, color_discrete_map={'Kit Blank': '#d62728','PCR Blank': '#9467bd' }, text='total_reads_after' ) fig.update_traces(texttemplate='%{text:,.0f}', textposition='outside') fig.update_layout( yaxis_type='log', yaxis_title='Reads (log scale)', legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99) ) fig.show()
Figure 8: Blank Sample Read Counts by Platform
Show code
if qc_df isnotNone: blank_df = qc_df[ qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False) ].copy() blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(lambda x: 'Kit Blank'if'kit'in x else'PCR Blank' )iflen(blank_df) >0:# Get max reads per platform and blank type pivot = blank_df.groupby(['platform', 'blank_type'])['total_reads_after'].max().unstack()print("Blank Sample Read Counts (After Filtering):")print(pivot.to_string())print("\n\nFold-Change Comparison (relative to MiSeq v1 Dual):")if'MiSeq v1 Dual'in pivot.index:for blank_type in pivot.columns: v1_dual_val = pivot.loc['MiSeq v1 Dual', blank_type]if pd.notna(v1_dual_val) and v1_dual_val >0:for platform in pivot.index:if platform !='MiSeq v1 Dual': other_val = pivot.loc[platform, blank_type]if pd.notna(other_val): fold = other_val / v1_dual_valprint(f" {blank_type}: {platform} has {fold:.1f}x reads vs v1 Dual")
Blank Sample Read Counts (After Filtering):
blank_type Kit Blank PCR Blank
platform
MiSeq i100 1084 794
MiSeq v1 56 26
MiSeq v1 Dual 136 16
Fold-Change Comparison (relative to MiSeq v1 Dual):
Kit Blank: MiSeq i100 has 8.0x reads vs v1 Dual
Kit Blank: MiSeq v1 has 0.4x reads vs v1 Dual
PCR Blank: MiSeq i100 has 49.6x reads vs v1 Dual
PCR Blank: MiSeq v1 has 1.6x reads vs v1 Dual
Blank Sequence Origin Analysis
To investigate the source of elevated blank reads on i100, we compared blank sequences against all real sample sequences using vsearch (97% identity threshold).
Show code
# Load blank comparison results if availableblank_matches_file = Path('blank_comparison/blank_matches.tsv')if blank_matches_file.exists(): matches_df = pd.read_csv(blank_matches_file, sep='\t', names=['query', 'target', 'identity', 'alnlen', 'mism', 'gaps'])# Extract sample names from target matches_df['source_sample'] = matches_df['target'].str.extract(r'(EMC2_[^_]+)')# Count matches per sample source_counts = matches_df['source_sample'].value_counts()print("Blank Sequence Origin Analysis (i100 blanks):")print(f" Total blank reads analyzed: 943")print(f" Matched to real samples (≥97% identity): {len(matches_df)} (92.4%)")print(f" Unmatched (environmental/kit): 72 (7.6%)")print("\nTop source samples for blank contamination:")for sample, count in source_counts.head(10).items(): pct = count /len(matches_df) *100print(f" {sample}: {count} reads ({pct:.1f}%)")else:print("Run scripts/compare_blanks_to_samples.sh to generate blank sequence analysis")
The contaminating samples share a critical characteristic with the blanks:
Show code
# Load sample indicesindices_file = Path('all_sample_indices.csv')if indices_file.exists(): indices_df = pd.read_csv(indices_file)# Filter to EMC2 samples emc2_indices = indices_df[indices_df['sample_name'].str.contains('EMC2', na=False)].copy()# Extract simple sample ID emc2_indices['simple_id'] = emc2_indices['sample_name'].str.extract(r'(EMC2_[^_]+|EMC2_Kit_Blank|EMC2_PCR_Blank)')# Group by i7 indexprint("EMC2 Sample Index Groups (by i7 index):\n")for i7, group in emc2_indices.groupby('i7_index'): samples = group['simple_id'].tolist() has_blank =any('Blank'in s for s in samples if s) marker =" ← BLANKS IN THIS GROUP"if has_blank else""print(f"i7 = {i7}:{marker}")for s in samples:if s:print(f" {s}")print()
The EMC2 blanks (Kit and PCR) share i7 index = AGCATACC with exactly 6 real samples:
Sample
i7 Index
i5 Index
Reads in Blanks
EMC2_20
AGCATACC
ACGACGTG
313 (36%)
EMC2_27
AGCATACC
ATATACAC
145 (17%)
EMC2_31
AGCATACC
CGTCGCTA
130 (15%)
EMC2_41
AGCATACC
GACACTGA
91 (10%)
EMC2_38
AGCATACC
CTAGAGCT
78 (9%)
EMC2_40
AGCATACC
GCTCTAGT
73 (8%)
Kit Blank
AGCATACC
TGCGTACG
-
PCR Blank
AGCATACC
TAGTGTAG
-
Index hopping mechanism:
During cluster amplification or sequencing, free-floating i5 index adapters can attach to reads from other samples
A read from EMC2_20 (i7=AGCATACC, i5=ACGACGTG) gets paired with the wrong i5 index (TGCGTACG from Kit Blank)
The demultiplexer assigns this read to Kit Blank instead of EMC2_20
This only affects samples sharing the same i7 index - samples with different i7 indices do not contaminate the blanks
Key finding: 92.4% of blank reads on i100 are index hopping artifacts from real samples, not environmental contamination. The v1-dual platform shows much lower index hopping rates.
Implications
Low-biomass samples at risk: Samples with low read counts sharing i7 indices with high-abundance samples may have significant cross-contamination
Unique dual indices recommended: Using unique index pairs (UDIs) instead of combinatorial indexing would eliminate this issue
Platform-specific: MiSeq i100 shows higher index hopping rates than v1-dual for the same library
Phantom Sample Analysis
Phantom samples are index combinations where both i7 and i5 are from real samples, but the combination was never actually used. These provide direct evidence of index hopping during sequencing.
Show code
phantom_file = Path('phantom_samples_miseq-i100.csv')if phantom_file.exists():# Read just the top rows (file is very large) phantom_df = pd.read_csv(phantom_file, nrows=20)print("Top 10 Phantom Samples (MiSeq i100):")print("-"*60)for i, row in phantom_df.head(10).iterrows():print(f"\n{row['index_pair']}: {row['count']:,} reads")# Parse the sample lists i7_samples = row['i7_from_samples'].split(', ') if pd.notna(row['i7_from_samples']) else [] i5_samples = row['i5_from_samples'].split(', ') if pd.notna(row['i5_from_samples']) else []# Count EMC2 samples i7_emc2 = [s for s in i7_samples if'EMC2'in s and'Blank'notin s] i5_emc2 = [s for s in i5_samples if'EMC2'in s and'Blank'notin s]print(f" i7 sources: {len(i7_samples)} samples ({len(i7_emc2)} EMC2)")print(f" i5 sources: {len(i5_samples)} samples ({len(i5_emc2)} EMC2)")else:print("Phantom samples file not found")
The top phantom sample (CTCTAGAG+CACGTCGT) has 5,058 reads—nearly 5x higher than other phantom samples (~900-1,200 reads). The explanation lies in the read depth of the source samples:
Show code
if qc_df isnotNone:# Samples contributing to top phantom i7_ctctagag_samples = ['EMC2_1', 'EMC2_2', 'EMC2_3', 'EMC2_4', 'EMC2_5', 'EMC2_6', 'EMC2_11'] i5_acgacgtg_samples = ['EMC2_7', 'EMC2_17', 'EMC2_20'] # EMC2 samples with this i5# Filter to i100 platform i100_df = qc_df[qc_df['platform'] =='MiSeq i100'].copy()# Calculate totals i7_reads = i100_df[i100_df['sample_id'].isin(i7_ctctagag_samples)]['total_reads_before'].sum() i5_reads = i100_df[i100_df['sample_id'].isin(i5_acgacgtg_samples)]['total_reads_before'].sum()print("Top Phantom (CTCTAGAG+CACGTCGT) Source Analysis:")print("="*55)print(f"\ni7 = CTCTAGAG (7 EMC2 samples):")for sample in i7_ctctagag_samples: row = i100_df[i100_df['sample_id'] == sample]iflen(row) >0: reads = row['total_reads_before'].values[0]print(f" {sample}: {reads:,} reads")print(f" TOTAL: {i7_reads:,.0f} reads")print(f"\ni5 = CACGTCGT (EMC2 samples only - also used by 15 RoL_RTF samples):")for sample in i5_acgacgtg_samples: row = i100_df[i100_df['sample_id'] == sample]iflen(row) >0: reads = row['total_reads_before'].values[0]print(f" {sample}: {reads:,} reads")print(f" EMC2 TOTAL: {i5_reads:,.0f} reads")print("\n"+"-"*55)print("EXPLANATION:")print("-"*55)print("""The phantom read count is proportional to the product of reads fromboth source pools. The top phantom has: • High-depth EMC2 samples on BOTH sides (~1.4M × ~0.6M) • Other phantoms involve mostly RoL_RTF samples with lower depthThis explains the ~5x difference: EMC2 samples have systematicallyhigher read depth than RoL_RTF samples, so index hopping betweenEMC2 i7 groups and EMC2 i5 groups produces more phantom reads.""")
Top Phantom (CTCTAGAG+CACGTCGT) Source Analysis:
=======================================================
i7 = CTCTAGAG (7 EMC2 samples):
EMC2_1: 143,444 reads
EMC2_2: 175,326 reads
EMC2_3: 228,274 reads
EMC2_4: 235,360 reads
EMC2_5: 177,592 reads
EMC2_6: 232,632 reads
EMC2_11: 187,040 reads
TOTAL: 1,379,668 reads
i5 = CACGTCGT (EMC2 samples only - also used by 15 RoL_RTF samples):
EMC2_7: 181,620 reads
EMC2_17: 218,048 reads
EMC2_20: 208,570 reads
EMC2 TOTAL: 608,238 reads
-------------------------------------------------------
EXPLANATION:
-------------------------------------------------------
The phantom read count is proportional to the product of reads from
both source pools. The top phantom has:
• High-depth EMC2 samples on BOTH sides (~1.4M × ~0.6M)
• Other phantoms involve mostly RoL_RTF samples with lower depth
This explains the ~5x difference: EMC2 samples have systematically
higher read depth than RoL_RTF samples, so index hopping between
EMC2 i7 groups and EMC2 i5 groups produces more phantom reads.
Phantom Sample Distribution
Show code
if phantom_file.exists(): phantom_df = pd.read_csv(phantom_file, nrows=50) fig = px.bar( phantom_df.head(20), x='index_pair', y='count', title='Top 20 Phantom Samples (MiSeq i100)', labels={'count': 'Read Count', 'index_pair': 'Index Pair (i7+i5)'}, color='count', color_continuous_scale='Reds' ) fig.update_layout( xaxis_tickangle=-45, xaxis_title='Phantom Index Pair', yaxis_title='Reads Assigned to Non-Existent Sample', showlegend=False ) fig.update_coloraxes(showscale=False) fig.show()# Summary statistics total_phantom = phantom_df['count'].sum()print(f"\nTop 50 phantom samples account for: {total_phantom:,} reads")print("These reads were incorrectly assigned to non-existent sample combinations")print("due to i7/i5 index hopping during sequencing.")
Top 50 phantom samples account for: 45,377 reads
These reads were incorrectly assigned to non-existent sample combinations
due to i7/i5 index hopping during sequencing.
Discussion
Quality Score Binning Impact
The MiSeq i100’s 4-bin quality score system:
Reduces storage requirements by limiting quality values to 4 discrete levels
Maintains high-quality base information (Q38 for good bases)
May affect downstream analysis that relies on precise quality scores
Index Hopping Concerns
The elevated undetermined reads on MiSeq i100 warrant attention:
GGGGGGGG patterns: Large numbers of reads with poly-G indices suggest:
Index read failures (no template to sequence)
Homopolymer DNA sequences in indices
Expected sample indices in undetermined: Detection of known sample indices in undetermined reads indicates:
Potential index hopping/cross-contamination
Sequencing of index regions may have quality issues
Recommendations
Monitor index hopping rates on i100 runs
Consider dual unique indices to minimize ambiguity
Evaluate if quality score binning affects variant calling accuracy for your applications
Conclusions
Quality Score Binning: MiSeq i100 uses 4-bin system (Q2/Q9/Q23/Q38) vs continuous on v1, but maintains comparable Q20/Q30 rates
No N Base Calls on i100: The i100 reports zero ambiguous (N) bases
Uncertain bases are assigned G instead (similar to poly-G in indices)
Positions 298-299 show >90% G content at lower quality - evidence of “no signal = G” behavior
Pipelines relying on N content for filtering will behave differently on i100
Index Hopping is Significant on i100:
Undetermined reads are 4.4x higher on i100 vs v1
92.4% of blank sample reads are index hopping artifacts, not environmental contamination
Blank contamination follows the i7 grouping pattern (samples sharing i7 with blanks contribute all contamination)
Phantom Samples Confirm Mechanism:
Phantom read counts are proportional to source sample read depths
EMC2-to-EMC2 phantom combinations show 5x higher rates due to higher read depth
This confirms index hopping occurs during cluster amplification/sequencing
Recommendation: Use unique dual indices (UDIs) instead of combinatorial indexing to eliminate index hopping artifacts, especially for low-biomass samples