Aggregate microbiome data for alluvial plot visualization

Aggregates ASV-level data with pre-calculated relative abundance (RA) to a chosen taxonomic level and computes mean group compositions (sample-weighted). Optionally completes missing taxa with zeros to ensure a rectangular structure required by alluvial/Sankey plots.

Usage

prepare_alluvial_data(
  data,
  tax_level,
  group_col = "Group",
  SampleID_col = "SampleID",
  groups = NULL,
  complete_zero = TRUE,
  clean_taxonomy = TRUE,
  preserve_higher_taxonomy = FALSE,
  hierarchy = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"),
  clean_levels = c("Phylum", "Class", "Order", "Family", "Genus")
)

Arguments

data: Data frame with RA already calculated at ASV level.
tax_level: Character string specifying the taxonomic level to aggregate to (e.g. "Family").
group_col: Character string specifying the grouping column (e.g. "SampleType").
SampleID_col: Character string specifying the sample ID column (default: "SampleID").
groups: Character vector of groups to include (NULL = all groups present in group_col).
complete_zero: Logical; if TRUE, completes missing taxa with RA = 0 within each sample before summarizing (default: TRUE).
clean_taxonomy: Logical; if TRUE, clean taxonomy using replace_incertae_sedis_NAs (default: TRUE).
preserve_higher_taxonomy: Logical; if TRUE, keep higher taxonomic ranks up to tax_level (default: FALSE).
hierarchy: Taxonomic hierarchy for cleaning (default: c("Domain","Phylum","Class","Order","Family","Genus")).
clean_levels: Taxonomic levels to clean (default: c("Phylum","Class","Order","Family","Genus")).

Value

A data frame with mean relative abundance per group_col and tax_level.

RA: Mean relative abundance per group after normalization (0–1).
<group_col>: Grouping variable used on the alluvial x-axis (the column named by group_col).
<tax_level>: Taxon identifier at the aggregation level (the column named by tax_level).
Higher taxonomy columns: Only when preserve_higher_taxonomy = TRUE: higher ranks up to tax_level.

`complete_zero` behavior

TRUE: Completes missing taxa within each sample with RA = 0, so taxa are averaged over all samples.
FALSE: Taxa absent from some samples are averaged only over samples where they appear (can inflate sporadic taxa).

`preserve_higher_taxonomy` behavior

TRUE: Keeps higher ranks up to tax_level. Note: rows created by the final completion may have missing higher-rank labels unless those columns are also completed/filled.
FALSE: Only returns tax_level plus grouping columns.

Data processing in the function step-by-step

Optionally clean taxonomy using replace_incertae_sedis_NAs.
Aggregate ASV-level RA to tax_level within each sample using sum().
If complete_zero = TRUE, add missing taxa per sample with RA = 0 (taxa are taken from all values observed in data[[tax_level]]).
Compute the mean RA per group_col (mean across samples), then normalize within each group so sum(RA) = 1.
Ensure a complete group_col \(\times\) tax_level grid (missing combinations are set to RA = 0).

Examples

library(dplyr)
library(tidyr)

# Toy example showing how complete_zero changes group means:
# Taxon B is absent from sample S1 (no row), but present in S2.
toy <- tibble::tibble(
  SampleID = c("S1","S2","S2"),
  Group    = c("G1","G1","G1"),
  Class    = c("A","A","B"),
  RA       = c(1.0, 0.5, 0.5)
)

# Without completion: B is averaged only over samples where it appears (inflated)
out_no0 <- prepare_alluvial_data(
  data = toy,
  tax_level = "Class",
  group_col = "Group",
  SampleID_col = "SampleID",
  complete_zero = FALSE,
  clean_taxonomy = FALSE
)

# With completion: missing taxa count as 0 in those samples
out_0 <- prepare_alluvial_data(
  data = toy,
  tax_level = "Class",
  group_col = "Group",
  SampleID_col = "SampleID",
  complete_zero = TRUE,
  clean_taxonomy = FALSE
)

out_no0 %>% arrange(Class)
#> # A tibble: 2 × 3
#>   Group Class    RA
#>   <chr> <chr> <dbl>
#> 1 G1    A       0.6
#> 2 G1    B       0.4
out_0   %>% arrange(Class)
#> # A tibble: 2 × 3
#>   Group Class    RA
#>   <chr> <chr> <dbl>
#> 1 G1    A      0.75
#> 2 G1    B      0.25