MiSeq Platform Comparison Report

MiSeq v1 vs MiSeq i100: Quality and Index Hopping Analysis

Author

EMC2 Project

Published

December 2, 2025

Show code
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
from itables import show as itables_show
import itables.options as itables_opts

# Configure itables defaults
itables_opts.maxBytes = 0  # Disable size limit warning

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

Executive Summary

This report compares sequencing quality and index demultiplexing between two MiSeq platforms:

  • MiSeq v1: Standard MiSeq with continuous quality scores
  • MiSeq i100: Updated MiSeq with 4-bin quality score system

Key Findings:

  1. Quality Score Binning: MiSeq i100 uses a 4-bin quality score system (Q2, Q9, Q23, Q38) vs continuous scores (Q0-Q40) on v1
  2. Index Hopping: i100 shows significantly higher undetermined reads, suggesting potential index mixing
  3. Read Quality: Despite binning differences, Q20/Q30 rates are comparable between platforms

Background

Quality Score Binning

The MiSeq i100 uses a binned quality score system to reduce file sizes:

Bin Phred Score ASCII Character Meaning
1 Q2 # Very low quality
2 Q9 * Low quality
3 Q23 8 Medium quality
4 Q38 G High quality

Standard MiSeq v1 provides full-resolution scores from Q0 to Q40+.

Study Design

  • 31 EMC2 samples sequenced on both MiSeq v1 (dual index) and MiSeq i100
  • All samples use dual 8bp indices (i7 + i5)
  • QC performed with fastp

Quality Score Distribution

Show code
# Load quality score distribution data
qual_file = Path('quality_scores_distribution.csv')

if qual_file.exists():
    qual_df = pd.read_csv(qual_file)

    # Calculate percentages
    qual_df['pct'] = qual_df.groupby('platform')['count'].transform(
        lambda x: x / x.sum() * 100
    )
else:
    qual_df = None
    print("Quality score distribution file not found. Run analyze_quality_scores.py first.")
Show code
if qual_df is not None:
    fig = px.bar(
        qual_df,
        x='phred_score',
        y='pct',
        color='platform',
        barmode='group',
        labels={
            'phred_score': 'Phred Quality Score',
            'pct': 'Percentage of Bases (%)',
            'platform': 'Platform'
        },
        title='Quality Score Distribution: MiSeq v1 vs i100',
        color_discrete_map={
            'MiSeq v1': '#1f77b4',
            'MiSeq i100': '#ff7f0e'
        }
    )

    fig.update_layout(
        xaxis=dict(dtick=5),
        bargap=0.15,
        legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99)
    )

    fig.show()
else:
    print("No quality score data available")
Figure 1: Quality Score Distribution by Platform

Quality Binning Summary

Show code
if qual_df is not None:
    # Top scores per platform
    summary = qual_df.groupby('platform').apply(
        lambda x: x.nlargest(5, 'pct')[['phred_score', 'ascii_char', 'pct']]
    ).reset_index(drop=True)

    for platform in qual_df['platform'].unique():
        print(f"\n{platform} - Top 5 Quality Scores:")
        pdata = qual_df[qual_df['platform'] == platform].nlargest(5, 'pct')
        for _, row in pdata.iterrows():
            print(f"  Q{row['phred_score']:2d} ('{row['ascii_char']}'): {row['pct']:.1f}%")
Table 1

MiSeq v1 - Top 5 Quality Scores:
  Q38 ('G'): 61.3%
  Q37 ('F'): 14.4%
  Q34 ('C'): 3.2%
  Q33 ('B'): 2.2%
  Q36 ('E'): 2.0%

MiSeq i100 - Top 5 Quality Scores:
  Q38 ('G'): 98.8%
  Q23 ('8'): 0.9%
  Q 9 ('*'): 0.3%

N Base Calls (Ambiguous Bases)

An important difference between platforms is how they handle uncertain base calls:

Show code
import json
from collections import defaultdict

# Collect N reads data from fastp JSON files
n_data = []
for platform, path in [('MiSeq i100', 'qc_results/miseq-i100'),
                        ('MiSeq v1', 'qc_results/miseq-v1'),
                        ('MiSeq v1 Dual', 'qc_results/miseq-v1-dual')]:
    platform_path = Path(path)
    if platform_path.exists():
        for f in platform_path.glob('*.json'):
            with open(f) as fh:
                d = json.load(fh)
            too_many_n = d['filtering_result']['too_many_N_reads']
            total_reads = d['summary']['before_filtering']['total_reads']
            # Get max N content from content curves
            r1_n = d.get('read1_before_filtering', {}).get('content_curves', {}).get('N', [])
            r2_n = d.get('read2_before_filtering', {}).get('content_curves', {}).get('N', [])
            max_n = max(max(r1_n) if r1_n else 0, max(r2_n) if r2_n else 0)
            n_data.append({'platform': platform, 'total_reads': total_reads,
                           'too_many_n': too_many_n, 'max_n_content': max_n})

if n_data:
    n_df = pd.DataFrame(n_data)
    summary = n_df.groupby('platform').agg({
        'total_reads': 'sum',
        'too_many_n': 'sum',
        'max_n_content': 'max'
    }).reset_index()
    summary['pct_n_filtered'] = summary['too_many_n'] / summary['total_reads'] * 100

    print("N Base Call Summary by Platform:")
    print("=" * 65)
    for _, row in summary.iterrows():
        print(f"\n{row['platform']}:")
        print(f"  Total reads: {row['total_reads']:,.0f}")
        print(f"  Reads filtered (too many N): {row['too_many_n']:,.0f} ({row['pct_n_filtered']:.4f}%)")
        print(f"  Max N content at any position: {row['max_n_content']:.4f}")

    print("\n" + "-" * 65)
    print("KEY FINDING: MiSeq i100 reports ZERO N base calls!")
    print("-" * 65)
    print("""
The i100's 4-bin quality scoring system never outputs 'N' (ambiguous)
bases. When the sequencer is uncertain about a base call, instead of
marking it as 'N' with a low quality score, it assigns one of the
low-quality bins (Q2 or Q9) with a definite A/T/C/G call.

This has implications for:
  • Downstream analysis that relies on N content for filtering
  • Variant calling pipelines that use N's to identify problem regions
  • Quality metrics that count ambiguous bases
""")

# Check for G enrichment at low-quality positions
print("\nG Content at Low-Quality Positions:")
print("-" * 65)

for platform, path in [('MiSeq i100', 'qc_results/miseq-i100'),
                        ('MiSeq v1', 'qc_results/miseq-v1')]:
    import numpy as np
    all_qual = []
    all_g = []
    platform_path = Path(path)
    if platform_path.exists():
        for f in list(platform_path.glob('*.json'))[:10]:
            with open(f) as fh:
                d = json.load(fh)
            q = d.get('read1_before_filtering', {}).get('quality_curves', {}).get('mean', [])
            g = d.get('read1_before_filtering', {}).get('content_curves', {}).get('G', [])
            if q and g:
                all_qual.append(q)
                all_g.append(g)

        if all_qual:
            mean_qual = np.mean(all_qual, axis=0)
            mean_g = np.mean(all_g, axis=0)
            low_qual_idx = np.argsort(mean_qual)[:5]

            print(f"\n{platform} - Positions with lowest quality:")
            for idx in low_qual_idx:
                print(f"  Position {idx+1}: Q={mean_qual[idx]:.1f}, G={mean_g[idx]*100:.1f}%")

print("""
On i100, positions 298-299 show >90% G content at lower quality scores.
This mirrors the poly-G pattern in index reads - when sequencing signal
fails on two-color chemistry, 'no signal' is called as G, not as N.
""")
N Base Call Summary by Platform:
=================================================================

MiSeq i100:
  Total reads: 5,451,386
  Reads filtered (too many N): 0 (0.0000%)
  Max N content at any position: 0.0000

MiSeq v1:
  Total reads: 2,934,798
  Reads filtered (too many N): 52 (0.0018%)
  Max N content at any position: 0.0116

MiSeq v1 Dual:
  Total reads: 4,008,800
  Reads filtered (too many N): 4,530 (0.1130%)
  Max N content at any position: 0.0208

-----------------------------------------------------------------
KEY FINDING: MiSeq i100 reports ZERO N base calls!
-----------------------------------------------------------------

The i100's 4-bin quality scoring system never outputs 'N' (ambiguous)
bases. When the sequencer is uncertain about a base call, instead of
marking it as 'N' with a low quality score, it assigns one of the
low-quality bins (Q2 or Q9) with a definite A/T/C/G call.

This has implications for:
  • Downstream analysis that relies on N content for filtering
  • Variant calling pipelines that use N's to identify problem regions
  • Quality metrics that count ambiguous bases


G Content at Low-Quality Positions:
-----------------------------------------------------------------

MiSeq i100 - Positions with lowest quality:
  Position 300: Q=30.2, G=4.9%
  Position 298: Q=35.6, G=98.2%
  Position 299: Q=36.1, G=92.2%
  Position 1: Q=36.8, G=0.2%
  Position 297: Q=36.9, G=1.4%

MiSeq v1 - Positions with lowest quality:
  Position 301: Q=9.9, G=0.7%
  Position 287: Q=16.9, G=2.5%
  Position 288: Q=18.8, G=0.7%
  Position 300: Q=19.9, G=2.3%
  Position 289: Q=19.9, G=1.1%

On i100, positions 298-299 show >90% G content at lower quality scores.
This mirrors the poly-G pattern in index reads - when sequencing signal
fails on two-color chemistry, 'no signal' is called as G, not as N.

Index Hopping Analysis

Undetermined Reads Summary

Show code
undet_file = Path('undetermined_summary.csv')

if undet_file.exists():
    undet_df = pd.read_csv(undet_file)
    display(undet_df.style.format({
        'total_undetermined': '{:,.0f}',
        'unique_indices': '{:,.0f}',
        'g_pattern_reads': '{:,.0f}',
        'both_g_reads': '{:,.0f}',
        'pct_g_pattern': '{:.1f}%'
    }))
else:
    undet_df = None
    print("Undetermined summary file not found. Run analyze_miseq_comparison.py first.")
  platform total_undetermined unique_indices g_pattern_reads both_g_reads pct_g_pattern
0 MiSeq v1 1,605,387 266,781 0 0 0.0%
1 MiSeq i100 7,042,878 463,730 4,604,819 3,937,905 65.4%
Show code
if undet_df is not None:
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Total Undetermined Reads', 'Unique Index Combinations'),
        specs=[[{'type': 'bar'}, {'type': 'bar'}]]
    )

    colors = ['#1f77b4', '#ff7f0e']

    fig.add_trace(
        go.Bar(
            x=undet_df['platform'],
            y=undet_df['total_undetermined'],
            marker_color=colors,
            text=undet_df['total_undetermined'].apply(lambda x: f'{x/1e6:.1f}M'),
            textposition='outside',
            name='Undetermined'
        ),
        row=1, col=1
    )

    fig.add_trace(
        go.Bar(
            x=undet_df['platform'],
            y=undet_df['unique_indices'],
            marker_color=colors,
            text=undet_df['unique_indices'].apply(lambda x: f'{x/1e3:.0f}K'),
            textposition='outside',
            name='Unique Indices'
        ),
        row=1, col=2
    )

    fig.update_layout(
        showlegend=False,
        title_text='Undetermined Reads Comparison'
    )

    fig.show()
Figure 2: Undetermined Reads by Platform

Phantom Samples (Index Swapping Between Samples)

“Phantom samples” are index combinations where both i7 and i5 indices are from real samples in the study, but the specific combination was never paired together. These represent true index hopping/swapping between samples during library prep or sequencing.

Show code
phantom_file = Path('phantom_samples_miseq-i100.csv')

if phantom_file.exists():
    phantom_df = pd.read_csv(phantom_file)

    print(f"Found {len(phantom_df)} phantom sample combinations")
    print(f"Total reads with swapped indices: {phantom_df['count'].sum():,}")

    if len(phantom_df) > 0:
        print("\nPhantom Samples (interactive table - click column headers to sort):")
        itables_show(
            phantom_df[['index_pair', 'count', 'i7_from_samples', 'i5_from_samples']],
            order=[[1, 'desc']],  # Sort by count descending
            scrollY="400px",
            scrollCollapse=True,
            paging=True,
            pageLength=25
        )
    else:
        print("\nNo phantom samples detected - minimal index hopping between samples!")
else:
    phantom_df = None
    print("Phantom samples file not found. Run analyze_miseq_comparison.py first.")
Found 427 phantom sample combinations
Total reads with swapped indices: 182,364

Phantom Samples (interactive table - click column headers to sort):
Loading ITables v2.5.2 from the internet... (need help?)
Show code
if phantom_df is not None and undet_df is not None:
    total_undet = undet_df[undet_df['platform'] == 'MiSeq i100']['total_undetermined'].values[0]
    phantom_reads = phantom_df['count'].sum() if len(phantom_df) > 0 else 0
    g_pattern = undet_df[undet_df['platform'] == 'MiSeq i100']['both_g_reads'].values[0]

    print("Index Hopping Summary (MiSeq i100):")
    print(f"  Total undetermined reads: {total_undet:,}")
    print(f"  Reads with GGGGGGGG+GGGGGGGG: {g_pattern:,} ({g_pattern/total_undet*100:.1f}%)")
    print(f"  Phantom samples detected: {len(phantom_df)}")
    print(f"  Reads from index swapping: {phantom_reads:,} ({phantom_reads/total_undet*100:.1f}%)")
Index Hopping Summary (MiSeq i100):
  Total undetermined reads: 7,042,878
  Reads with GGGGGGGG+GGGGGGGG: 3,937,905 (55.9%)
  Phantom samples detected: 427
  Reads from index swapping: 182,364 (2.6%)
Show code
if phantom_df is not None and len(phantom_df) > 0:
    top_phantoms = phantom_df.nlargest(20, 'count').copy()

    fig = px.bar(
        top_phantoms,
        x='count',
        y='index_pair',
        orientation='h',
        title='Top 20 Phantom Sample Index Combinations',
        labels={'count': 'Read Count', 'index_pair': 'Index Pair'},
        color='count',
        color_continuous_scale='Reds'
    )

    fig.update_layout(yaxis={'categoryorder': 'total ascending'})
    fig.show()
Figure 3: Top Phantom Samples (Index Swapping Between Samples)

Indices with Partial Study Matches

Looking at index pairs where only one index (i7 or i5) matches the study indices:

Show code
# Load ALL sample indices (not just EMC2)
expected_file = Path('all_sample_indices.csv')

if expected_file.exists():
    expected_df = pd.read_csv(expected_file)
    unique_i7 = set(expected_df['i7_index'])
    unique_i5_rc = set(expected_df['i5_revcomp'])

    # Reload undetermined for i100
    undet_i100_file = Path('qc_results/miseq-i100/Undetermined_S0_indices.txt')
    if undet_i100_file.exists():
        undet_raw = []
        with open(undet_i100_file) as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) >= 2 and '+' in parts[1]:
                    count = int(parts[0])
                    i7, i5 = parts[1].split('+')
                    undet_raw.append({'i7': i7, 'i5': i5, 'count': count})

        undet_check = pd.DataFrame(undet_raw)

        # i7 matches study, i5 doesn't
        i7_only = undet_check[
            (undet_check['i7'].isin(unique_i7)) &
            (~undet_check['i5'].isin(unique_i5_rc))
        ]['count'].sum()

        # i5 matches study, i7 doesn't
        i5_only = undet_check[
            (~undet_check['i7'].isin(unique_i7)) &
            (undet_check['i5'].isin(unique_i5_rc))
        ]['count'].sum()

        # Neither matches
        neither = undet_check[
            (~undet_check['i7'].isin(unique_i7)) &
            (~undet_check['i5'].isin(unique_i5_rc))
        ]['count'].sum()

        total = undet_check['count'].sum()

        print("Partial Index Match Analysis (MiSeq i100):")
        print(f"  Only i7 from study: {i7_only:,} reads ({i7_only/total*100:.1f}%)")
        print(f"  Only i5 from study: {i5_only:,} reads ({i5_only/total*100:.1f}%)")
        print(f"  Neither from study: {neither:,} reads ({neither/total*100:.1f}%)")
Partial Index Match Analysis (MiSeq i100):
  Only i7 from study: 1,160,700 reads (16.5%)
  Only i5 from study: 1,321,170 reads (18.8%)
  Neither from study: 4,378,644 reads (62.2%)

Sequencing Quality Metrics

QC Metrics by Platform

Show code
qc_file = Path('miseq_qc_metrics.csv')

if qc_file.exists():
    qc_df = pd.read_csv(qc_file)

    # Summary statistics
    summary = qc_df.groupby('platform').agg({
        'total_reads_after': ['count', 'mean', 'std'],
        'q20_rate_after': ['mean', 'std'],
        'q30_rate_after': ['mean', 'std'],
        'pct_reads_lost': ['mean', 'std']
    }).round(4)

    print("QC Summary by Platform:")
    display(summary)
else:
    qc_df = None
    print("QC metrics file not found. Run analyze_miseq_comparison.py first.")
QC Summary by Platform:
total_reads_after q20_rate_after q30_rate_after pct_reads_lost
count mean std mean std mean std mean std
platform
MiSeq i100 31 175545.8065 53232.3715 0.9965 0.0012 0.9864 0.0036 0.1900 0.1160
MiSeq v1 37 78810.2703 55153.0448 0.8835 0.0087 0.7799 0.0118 5.0808 16.0907
MiSeq v1 Dual 31 122482.1290 36321.6695 0.8128 0.0140 0.6711 0.0182 8.8326 15.0575
Show code
if qc_df is not None:
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Q20 Rate (After Filtering)', 'Q30 Rate (After Filtering)'),
        horizontal_spacing=0.12
    )

    # Define consistent colors for each platform
    platform_colors = {
        'MiSeq v1': '#1f77b4',
        'MiSeq v1 Dual': '#2ca02c',
        'MiSeq i100': '#ff7f0e'
    }

    platforms = qc_df['platform'].unique()

    for i, metric in enumerate(['q20_rate_after', 'q30_rate_after'], 1):
        for platform in platforms:
            pdata = qc_df[qc_df['platform'] == platform]
            fig.add_trace(
                go.Box(
                    y=pdata[metric],
                    name=platform,
                    boxpoints='all',
                    jitter=0.3,
                    pointpos=-1.5,
                    marker_color=platform_colors.get(platform, '#999999'),
                    line_color=platform_colors.get(platform, '#999999'),
                    text=pdata['sample_id'],  # Add sample names for hover
                    hovertemplate='<b>%{text}</b><br>%{y:.4f}<extra></extra>',
                    showlegend=(i == 1)
                ),
                row=1, col=i
            )

    fig.update_layout(
        boxmode='group',
        height=500,
        margin=dict(t=60)  # Add top margin to prevent title overlap
    )

    fig.show()
Figure 4: Q20 and Q30 Rates by Platform
Show code
if qc_df is not None:
    fig = px.box(
        qc_df,
        x='platform',
        y='total_reads_after',
        color='platform',
        points='all',
        hover_data=['sample_id'],
        title='Read Depth by Platform',
        labels={
            'total_reads_after': 'Reads per Sample (After Filtering)',
            'platform': 'Platform'
        },
        color_discrete_map={
            'MiSeq v1': '#1f77b4',
            'MiSeq v1 Dual': '#2ca02c',
            'MiSeq i100': '#ff7f0e'
        }
    )

    fig.update_layout(showlegend=False)
    fig.show()
Figure 5: Read Depth Distribution by Platform

Sample-Level Comparison

Paired Comparison: MiSeq v1 Dual vs i100

Show code
comp_file = Path('shared_samples_comparison.csv')

if comp_file.exists():
    comp_df = pd.read_csv(comp_file)
    print(f"Found {len(comp_df)} samples sequenced on both platforms")
    print("\nShared samples comparison (interactive table):")
    itables_show(
        comp_df,
        order=[[0, 'asc']],
        scrollY="300px",
        paging=False
    )
else:
    comp_df = None
    print("Shared samples comparison file not found. Run analyze_miseq_comparison.py first.")
Found 31 samples sequenced on both platforms

Shared samples comparison (interactive table):
Loading ITables v2.5.2 from the internet... (need help?)
Show code
if comp_df is not None:
    fig = px.scatter(
        comp_df,
        x='v1_dual_reads',
        y='i100_reads',
        text='sample_id',
        title='Read Depth: MiSeq v1 Dual vs i100',
        labels={
            'v1_dual_reads': 'MiSeq v1 Dual (reads)',
            'i100_reads': 'MiSeq i100 (reads)'
        }
    )

    fig.update_traces(
        textposition='top center',
        textfont_size=8,
        hovertemplate='<b>%{text}</b><br>v1 Dual: %{x:,.0f}<br>i100: %{y:,.0f}<extra></extra>'
    )

    # Add 1:1 line
    max_val = max(comp_df['v1_dual_reads'].max(), comp_df['i100_reads'].max())
    fig.add_trace(
        go.Scatter(
            x=[0, max_val],
            y=[0, max_val],
            mode='lines',
            line=dict(dash='dash', color='gray'),
            name='1:1 line',
            hoverinfo='skip'
        )
    )

    fig.show()
Figure 6: Paired Read Depth Comparison
Show code
if comp_df is not None:
    fig = px.scatter(
        comp_df,
        x='v1_dual_q30',
        y='i100_q30',
        text='sample_id',
        title='Q30 Rate: MiSeq v1 Dual vs i100',
        labels={
            'v1_dual_q30': 'MiSeq v1 Dual (Q30 rate)',
            'i100_q30': 'MiSeq i100 (Q30 rate)'
        }
    )

    fig.update_traces(
        textposition='top center',
        textfont_size=8,
        hovertemplate='<b>%{text}</b><br>v1 Dual: %{x:.4f}<br>i100: %{y:.4f}<extra></extra>'
    )

    # Add 1:1 line
    fig.add_trace(
        go.Scatter(
            x=[0.7, 1.0],
            y=[0.7, 1.0],
            mode='lines',
            line=dict(dash='dash', color='gray'),
            name='1:1 line',
            hoverinfo='skip'
        )
    )

    fig.show()
Figure 7: Paired Q30 Rate Comparison

Blank Sample Comparison

Comparing negative controls (Kit blank and PCR blank) across platforms helps assess background contamination levels.

Show code
if qc_df is not None:
    # Filter for blank samples (case-insensitive matching)
    blank_df = qc_df[
        qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False)
    ].copy()

    # Categorize by blank type
    blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(
        lambda x: 'Kit Blank' if 'kit' in x else ('PCR Blank' if 'pcr' in x else 'Other')
    )

    if len(blank_df) > 0:
        print("Blank Samples Found:")
        display(blank_df[['sample_id', 'platform', 'total_reads_before',
                          'total_reads_after', 'q20_rate_after', 'q30_rate_after',
                          'pct_reads_lost', 'blank_type']].style.format({
            'total_reads_before': '{:,.0f}',
            'total_reads_after': '{:,.0f}',
            'q20_rate_after': '{:.4f}',
            'q30_rate_after': '{:.4f}',
            'pct_reads_lost': '{:.1f}%'
        }))
    else:
        print("No blank samples found")
Blank Samples Found:
  sample_id platform total_reads_before total_reads_after q20_rate_after q30_rate_after pct_reads_lost blank_type
27 EMC2_kit MiSeq v1 26 18 0.8839 0.7898 30.8% Kit Blank
33 EMC2_kit MiSeq v1 172 56 0.8755 0.7683 67.4% Kit Blank
36 EMC2_PCR MiSeq v1 84 26 0.8658 0.7596 69.0% PCR Blank
66 EMC2_Kit MiSeq v1 Dual 220 136 0.7866 0.6445 38.2% Kit Blank
67 EMC2_PCR MiSeq v1 Dual 96 16 0.7481 0.5818 83.3% PCR Blank
97 EMC2_Kit MiSeq i100 1,086 1,084 0.9945 0.9813 0.2% Kit Blank
98 EMC2_PCR MiSeq i100 800 794 0.9924 0.9745 0.8% PCR Blank
Show code
if qc_df is not None:
    blank_df = qc_df[
        qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False)
    ].copy()

    blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(
        lambda x: 'Kit Blank' if 'kit' in x else 'PCR Blank'
    )

    if len(blank_df) > 0:
        # Summarize by platform and blank type (take max if duplicates)
        blank_summary = blank_df.groupby(['platform', 'blank_type']).agg({
            'total_reads_after': 'max',
            'total_reads_before': 'max'
        }).reset_index()

        fig = px.bar(
            blank_summary,
            x='platform',
            y='total_reads_after',
            color='blank_type',
            barmode='group',
            title='Blank Sample Read Counts Across Platforms',
            labels={
                'total_reads_after': 'Reads (After Filtering)',
                'platform': 'Platform',
                'blank_type': 'Blank Type'
            },
            color_discrete_map={
                'Kit Blank': '#d62728',
                'PCR Blank': '#9467bd'
            },
            text='total_reads_after'
        )

        fig.update_traces(texttemplate='%{text:,.0f}', textposition='outside')
        fig.update_layout(
            yaxis_type='log',
            yaxis_title='Reads (log scale)',
            legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99)
        )
        fig.show()
Figure 8: Blank Sample Read Counts by Platform
Show code
if qc_df is not None:
    blank_df = qc_df[
        qc_df['sample_id'].str.lower().str.contains('kit|pcr', na=False)
    ].copy()

    blank_df['blank_type'] = blank_df['sample_id'].str.lower().apply(
        lambda x: 'Kit Blank' if 'kit' in x else 'PCR Blank'
    )

    if len(blank_df) > 0:
        # Get max reads per platform and blank type
        pivot = blank_df.groupby(['platform', 'blank_type'])['total_reads_after'].max().unstack()

        print("Blank Sample Read Counts (After Filtering):")
        print(pivot.to_string())

        print("\n\nFold-Change Comparison (relative to MiSeq v1 Dual):")
        if 'MiSeq v1 Dual' in pivot.index:
            for blank_type in pivot.columns:
                v1_dual_val = pivot.loc['MiSeq v1 Dual', blank_type]
                if pd.notna(v1_dual_val) and v1_dual_val > 0:
                    for platform in pivot.index:
                        if platform != 'MiSeq v1 Dual':
                            other_val = pivot.loc[platform, blank_type]
                            if pd.notna(other_val):
                                fold = other_val / v1_dual_val
                                print(f"  {blank_type}: {platform} has {fold:.1f}x reads vs v1 Dual")
Blank Sample Read Counts (After Filtering):
blank_type     Kit Blank  PCR Blank
platform                           
MiSeq i100          1084        794
MiSeq v1              56         26
MiSeq v1 Dual        136         16


Fold-Change Comparison (relative to MiSeq v1 Dual):
  Kit Blank: MiSeq i100 has 8.0x reads vs v1 Dual
  Kit Blank: MiSeq v1 has 0.4x reads vs v1 Dual
  PCR Blank: MiSeq i100 has 49.6x reads vs v1 Dual
  PCR Blank: MiSeq v1 has 1.6x reads vs v1 Dual

Blank Sequence Origin Analysis

To investigate the source of elevated blank reads on i100, we compared blank sequences against all real sample sequences using vsearch (97% identity threshold).

Show code
# Load blank comparison results if available
blank_matches_file = Path('blank_comparison/blank_matches.tsv')

if blank_matches_file.exists():
    matches_df = pd.read_csv(blank_matches_file, sep='\t',
                             names=['query', 'target', 'identity', 'alnlen', 'mism', 'gaps'])

    # Extract sample names from target
    matches_df['source_sample'] = matches_df['target'].str.extract(r'(EMC2_[^_]+)')

    # Count matches per sample
    source_counts = matches_df['source_sample'].value_counts()

    print("Blank Sequence Origin Analysis (i100 blanks):")
    print(f"  Total blank reads analyzed: 943")
    print(f"  Matched to real samples (≥97% identity): {len(matches_df)} (92.4%)")
    print(f"  Unmatched (environmental/kit): 72 (7.6%)")
    print("\nTop source samples for blank contamination:")
    for sample, count in source_counts.head(10).items():
        pct = count / len(matches_df) * 100
        print(f"  {sample}: {count} reads ({pct:.1f}%)")
else:
    print("Run scripts/compare_blanks_to_samples.sh to generate blank sequence analysis")
Blank Sequence Origin Analysis (i100 blanks):
  Total blank reads analyzed: 943
  Matched to real samples (≥97% identity): 871 (92.4%)
  Unmatched (environmental/kit): 72 (7.6%)

Top source samples for blank contamination:
  EMC2_20: 313 reads (35.9%)
  EMC2_27: 145 reads (16.6%)
  EMC2_31: 130 reads (14.9%)
  EMC2_41: 91 reads (10.4%)
  EMC2_38: 78 reads (9.0%)
  EMC2_40: 73 reads (8.4%)
  EMC2_103910: 5 reads (0.6%)
  EMC2_7: 3 reads (0.3%)
  EMC2_100438: 3 reads (0.3%)
  EMC2_4: 3 reads (0.3%)

Index Pattern Analysis

The contaminating samples share a critical characteristic with the blanks:

Show code
# Load sample indices
indices_file = Path('all_sample_indices.csv')

if indices_file.exists():
    indices_df = pd.read_csv(indices_file)

    # Filter to EMC2 samples
    emc2_indices = indices_df[indices_df['sample_name'].str.contains('EMC2', na=False)].copy()

    # Extract simple sample ID
    emc2_indices['simple_id'] = emc2_indices['sample_name'].str.extract(r'(EMC2_[^_]+|EMC2_Kit_Blank|EMC2_PCR_Blank)')

    # Group by i7 index
    print("EMC2 Sample Index Groups (by i7 index):\n")
    for i7, group in emc2_indices.groupby('i7_index'):
        samples = group['simple_id'].tolist()
        has_blank = any('Blank' in s for s in samples if s)
        marker = " ← BLANKS IN THIS GROUP" if has_blank else ""
        print(f"i7 = {i7}:{marker}")
        for s in samples:
            if s:
                print(f"    {s}")
        print()
EMC2 Sample Index Groups (by i7 index):

i7 = AGCATACC:
    EMC2_20
    EMC2_27
    EMC2_31
    EMC2_38
    EMC2_40
    EMC2_41
    EMC2_Kit
    EMC2_PCR

i7 = ATGAGCTC:
    EMC2_7
    EMC2_8
    EMC2_9
    EMC2_12
    EMC2_100438
    EMC2_103910
    EMC2_18
    EMC2_19

i7 = CGAGCTAG:
    EMC2_17
    EMC2_21
    EMC2_26
    EMC2_28
    EMC2_29
    EMC2_32
    EMC2_36
    EMC2_39

i7 = CTCTAGAG:
    EMC2_5
    EMC2_11
    EMC2_1
    EMC2_2
    EMC2_3
    EMC2_4
    EMC2_6

Mechanism: i5 Index Hopping

The EMC2 blanks (Kit and PCR) share i7 index = AGCATACC with exactly 6 real samples:

Sample i7 Index i5 Index Reads in Blanks
EMC2_20 AGCATACC ACGACGTG 313 (36%)
EMC2_27 AGCATACC ATATACAC 145 (17%)
EMC2_31 AGCATACC CGTCGCTA 130 (15%)
EMC2_41 AGCATACC GACACTGA 91 (10%)
EMC2_38 AGCATACC CTAGAGCT 78 (9%)
EMC2_40 AGCATACC GCTCTAGT 73 (8%)
Kit Blank AGCATACC TGCGTACG -
PCR Blank AGCATACC TAGTGTAG -

Index hopping mechanism:

  1. During cluster amplification or sequencing, free-floating i5 index adapters can attach to reads from other samples
  2. A read from EMC2_20 (i7=AGCATACC, i5=ACGACGTG) gets paired with the wrong i5 index (TGCGTACG from Kit Blank)
  3. The demultiplexer assigns this read to Kit Blank instead of EMC2_20
  4. This only affects samples sharing the same i7 index - samples with different i7 indices do not contaminate the blanks

Key finding: 92.4% of blank reads on i100 are index hopping artifacts from real samples, not environmental contamination. The v1-dual platform shows much lower index hopping rates.

Implications

  1. Low-biomass samples at risk: Samples with low read counts sharing i7 indices with high-abundance samples may have significant cross-contamination
  2. Unique dual indices recommended: Using unique index pairs (UDIs) instead of combinatorial indexing would eliminate this issue
  3. Platform-specific: MiSeq i100 shows higher index hopping rates than v1-dual for the same library

Phantom Sample Analysis

Phantom samples are index combinations where both i7 and i5 are from real samples, but the combination was never actually used. These provide direct evidence of index hopping during sequencing.

Show code
phantom_file = Path('phantom_samples_miseq-i100.csv')

if phantom_file.exists():
    # Read just the top rows (file is very large)
    phantom_df = pd.read_csv(phantom_file, nrows=20)

    print("Top 10 Phantom Samples (MiSeq i100):")
    print("-" * 60)

    for i, row in phantom_df.head(10).iterrows():
        print(f"\n{row['index_pair']}: {row['count']:,} reads")

        # Parse the sample lists
        i7_samples = row['i7_from_samples'].split(', ') if pd.notna(row['i7_from_samples']) else []
        i5_samples = row['i5_from_samples'].split(', ') if pd.notna(row['i5_from_samples']) else []

        # Count EMC2 samples
        i7_emc2 = [s for s in i7_samples if 'EMC2' in s and 'Blank' not in s]
        i5_emc2 = [s for s in i5_samples if 'EMC2' in s and 'Blank' not in s]

        print(f"  i7 sources: {len(i7_samples)} samples ({len(i7_emc2)} EMC2)")
        print(f"  i5 sources: {len(i5_samples)} samples ({len(i5_emc2)} EMC2)")
else:
    print("Phantom samples file not found")
Top 10 Phantom Samples (MiSeq i100):
------------------------------------------------------------

CTCTAGAG+CACGTCGT: 5,058 reads
  i7 sources: 7 samples (7 EMC2)
  i5 sources: 18 samples (3 EMC2)

ATAGTACC+ACGGTGTC: 1,164 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

ATAGTACC+CACTCACG: 1,087 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

ATAGTACC+CTCGATGA: 1,055 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

GCGTATAC+ACGGTGTC: 978 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

ATAGTACC+AGATATCC: 975 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

CGAAGTAT+ACGGTGTC: 973 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

GACATAGT+CTACACTA: 959 reads
  i7 sources: 8 samples (0 EMC2)
  i5 sources: 19 samples (3 EMC2)

ATAGTACC+CAGATAGT: 935 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

CTCGACTT+CACTCACG: 916 reads
  i7 sources: 16 samples (0 EMC2)
  i5 sources: 12 samples (0 EMC2)

Why Does the Top Phantom Have 5x More Reads?

The top phantom sample (CTCTAGAG+CACGTCGT) has 5,058 reads—nearly 5x higher than other phantom samples (~900-1,200 reads). The explanation lies in the read depth of the source samples:

Show code
if qc_df is not None:
    # Samples contributing to top phantom
    i7_ctctagag_samples = ['EMC2_1', 'EMC2_2', 'EMC2_3', 'EMC2_4', 'EMC2_5', 'EMC2_6', 'EMC2_11']
    i5_acgacgtg_samples = ['EMC2_7', 'EMC2_17', 'EMC2_20']  # EMC2 samples with this i5

    # Filter to i100 platform
    i100_df = qc_df[qc_df['platform'] == 'MiSeq i100'].copy()

    # Calculate totals
    i7_reads = i100_df[i100_df['sample_id'].isin(i7_ctctagag_samples)]['total_reads_before'].sum()
    i5_reads = i100_df[i100_df['sample_id'].isin(i5_acgacgtg_samples)]['total_reads_before'].sum()

    print("Top Phantom (CTCTAGAG+CACGTCGT) Source Analysis:")
    print("=" * 55)
    print(f"\ni7 = CTCTAGAG (7 EMC2 samples):")

    for sample in i7_ctctagag_samples:
        row = i100_df[i100_df['sample_id'] == sample]
        if len(row) > 0:
            reads = row['total_reads_before'].values[0]
            print(f"  {sample}: {reads:,} reads")
    print(f"  TOTAL: {i7_reads:,.0f} reads")

    print(f"\ni5 = CACGTCGT (EMC2 samples only - also used by 15 RoL_RTF samples):")
    for sample in i5_acgacgtg_samples:
        row = i100_df[i100_df['sample_id'] == sample]
        if len(row) > 0:
            reads = row['total_reads_before'].values[0]
            print(f"  {sample}: {reads:,} reads")
    print(f"  EMC2 TOTAL: {i5_reads:,.0f} reads")

    print("\n" + "-" * 55)
    print("EXPLANATION:")
    print("-" * 55)
    print("""
The phantom read count is proportional to the product of reads from
both source pools. The top phantom has:
  • High-depth EMC2 samples on BOTH sides (~1.4M × ~0.6M)
  • Other phantoms involve mostly RoL_RTF samples with lower depth

This explains the ~5x difference: EMC2 samples have systematically
higher read depth than RoL_RTF samples, so index hopping between
EMC2 i7 groups and EMC2 i5 groups produces more phantom reads.
""")
Top Phantom (CTCTAGAG+CACGTCGT) Source Analysis:
=======================================================

i7 = CTCTAGAG (7 EMC2 samples):
  EMC2_1: 143,444 reads
  EMC2_2: 175,326 reads
  EMC2_3: 228,274 reads
  EMC2_4: 235,360 reads
  EMC2_5: 177,592 reads
  EMC2_6: 232,632 reads
  EMC2_11: 187,040 reads
  TOTAL: 1,379,668 reads

i5 = CACGTCGT (EMC2 samples only - also used by 15 RoL_RTF samples):
  EMC2_7: 181,620 reads
  EMC2_17: 218,048 reads
  EMC2_20: 208,570 reads
  EMC2 TOTAL: 608,238 reads

-------------------------------------------------------
EXPLANATION:
-------------------------------------------------------

The phantom read count is proportional to the product of reads from
both source pools. The top phantom has:
  • High-depth EMC2 samples on BOTH sides (~1.4M × ~0.6M)
  • Other phantoms involve mostly RoL_RTF samples with lower depth

This explains the ~5x difference: EMC2 samples have systematically
higher read depth than RoL_RTF samples, so index hopping between
EMC2 i7 groups and EMC2 i5 groups produces more phantom reads.

Phantom Sample Distribution

Show code
if phantom_file.exists():
    phantom_df = pd.read_csv(phantom_file, nrows=50)

    fig = px.bar(
        phantom_df.head(20),
        x='index_pair',
        y='count',
        title='Top 20 Phantom Samples (MiSeq i100)',
        labels={'count': 'Read Count', 'index_pair': 'Index Pair (i7+i5)'},
        color='count',
        color_continuous_scale='Reds'
    )

    fig.update_layout(
        xaxis_tickangle=-45,
        xaxis_title='Phantom Index Pair',
        yaxis_title='Reads Assigned to Non-Existent Sample',
        showlegend=False
    )
    fig.update_coloraxes(showscale=False)
    fig.show()

    # Summary statistics
    total_phantom = phantom_df['count'].sum()
    print(f"\nTop 50 phantom samples account for: {total_phantom:,} reads")
    print("These reads were incorrectly assigned to non-existent sample combinations")
    print("due to i7/i5 index hopping during sequencing.")

Top 50 phantom samples account for: 45,377 reads
These reads were incorrectly assigned to non-existent sample combinations
due to i7/i5 index hopping during sequencing.

Discussion

Quality Score Binning Impact

The MiSeq i100’s 4-bin quality score system:

  • Reduces storage requirements by limiting quality values to 4 discrete levels
  • Maintains high-quality base information (Q38 for good bases)
  • May affect downstream analysis that relies on precise quality scores

Index Hopping Concerns

The elevated undetermined reads on MiSeq i100 warrant attention:

  1. GGGGGGGG patterns: Large numbers of reads with poly-G indices suggest:
    • Index read failures (no template to sequence)
    • Homopolymer DNA sequences in indices
  2. Expected sample indices in undetermined: Detection of known sample indices in undetermined reads indicates:
    • Potential index hopping/cross-contamination
    • Sequencing of index regions may have quality issues

Recommendations

  1. Monitor index hopping rates on i100 runs
  2. Consider dual unique indices to minimize ambiguity
  3. Evaluate if quality score binning affects variant calling accuracy for your applications

Conclusions

  1. Quality Score Binning: MiSeq i100 uses 4-bin system (Q2/Q9/Q23/Q38) vs continuous on v1, but maintains comparable Q20/Q30 rates
  2. No N Base Calls on i100: The i100 reports zero ambiguous (N) bases
    • Uncertain bases are assigned G instead (similar to poly-G in indices)
    • Positions 298-299 show >90% G content at lower quality - evidence of “no signal = G” behavior
    • Pipelines relying on N content for filtering will behave differently on i100
  3. Index Hopping is Significant on i100:
    • Undetermined reads are 4.4x higher on i100 vs v1
    • 92.4% of blank sample reads are index hopping artifacts, not environmental contamination
    • Blank contamination follows the i7 grouping pattern (samples sharing i7 with blanks contribute all contamination)
  4. Phantom Samples Confirm Mechanism:
    • Phantom read counts are proportional to source sample read depths
    • EMC2-to-EMC2 phantom combinations show 5x higher rates due to higher read depth
    • This confirms index hopping occurs during cluster amplification/sequencing
  5. Recommendation: Use unique dual indices (UDIs) instead of combinatorial indexing to eliminate index hopping artifacts, especially for low-biomass samples

Report generated with Quarto and Python