Aggregate microbiome data for alluvial plot visualization
Source:R/alluvial_data.r
prepare_alluvial_data.RdAggregates ASV-level data with pre-calculated relative abundance (RA) to a chosen
taxonomic level and computes mean group compositions (sample-weighted). Optionally
completes missing taxa with zeros to ensure a rectangular structure required by
alluvial/Sankey plots.
Usage
prepare_alluvial_data(
data,
tax_level,
group_col = "Group",
SampleID_col = "SampleID",
groups = NULL,
complete_zero = TRUE,
clean_taxonomy = TRUE,
preserve_higher_taxonomy = FALSE,
hierarchy = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"),
clean_levels = c("Phylum", "Class", "Order", "Family", "Genus")
)Arguments
- data
Data frame with
RAalready calculated at ASV level.- tax_level
Character string specifying the taxonomic level to aggregate to (e.g.
"Family").- group_col
Character string specifying the grouping column (e.g.
"SampleType").- SampleID_col
Character string specifying the sample ID column (default:
"SampleID").- groups
Character vector of groups to include (NULL = all groups present in
group_col).- complete_zero
Logical; if TRUE, completes missing taxa with
RA = 0within each sample before summarizing (default: TRUE).- clean_taxonomy
Logical; if TRUE, clean taxonomy using
replace_incertae_sedis_NAs(default: TRUE).- preserve_higher_taxonomy
Logical; if TRUE, keep higher taxonomic ranks up to
tax_level(default: FALSE).- hierarchy
Taxonomic hierarchy for cleaning (default:
c("Domain","Phylum","Class","Order","Family","Genus")).- clean_levels
Taxonomic levels to clean (default:
c("Phylum","Class","Order","Family","Genus")).
Value
A data frame with mean relative abundance per group_col and tax_level.
- RA
Mean relative abundance per group after normalization (0–1).
<group_col>Grouping variable used on the alluvial x-axis (the column named by
group_col).<tax_level>Taxon identifier at the aggregation level (the column named by
tax_level).- Higher taxonomy columns
Only when
preserve_higher_taxonomy = TRUE: higher ranks up totax_level.
complete_zero behavior
TRUECompletes missing taxa within each sample with
RA = 0, so taxa are averaged over all samples.FALSETaxa absent from some samples are averaged only over samples where they appear (can inflate sporadic taxa).
preserve_higher_taxonomy behavior
TRUEKeeps higher ranks up to
tax_level. Note: rows created by the final completion may have missing higher-rank labels unless those columns are also completed/filled.FALSEOnly returns
tax_levelplus grouping columns.
Data processing in the function step-by-step
Optionally clean taxonomy using
replace_incertae_sedis_NAs.Aggregate ASV-level
RAtotax_levelwithin each sample usingsum().If
complete_zero = TRUE, add missing taxa per sample withRA = 0(taxa are taken from all values observed indata[[tax_level]]).Compute the mean
RApergroup_col(mean across samples), then normalize within each group sosum(RA) = 1.Ensure a complete
group_col\(\times\)tax_levelgrid (missing combinations are set toRA = 0).
Examples
library(dplyr)
library(tidyr)
# Toy example showing how complete_zero changes group means:
# Taxon B is absent from sample S1 (no row), but present in S2.
toy <- tibble::tibble(
SampleID = c("S1","S2","S2"),
Group = c("G1","G1","G1"),
Class = c("A","A","B"),
RA = c(1.0, 0.5, 0.5)
)
# Without completion: B is averaged only over samples where it appears (inflated)
out_no0 <- prepare_alluvial_data(
data = toy,
tax_level = "Class",
group_col = "Group",
SampleID_col = "SampleID",
complete_zero = FALSE,
clean_taxonomy = FALSE
)
# With completion: missing taxa count as 0 in those samples
out_0 <- prepare_alluvial_data(
data = toy,
tax_level = "Class",
group_col = "Group",
SampleID_col = "SampleID",
complete_zero = TRUE,
clean_taxonomy = FALSE
)
out_no0 %>% arrange(Class)
#> # A tibble: 2 × 3
#> Group Class RA
#> <chr> <chr> <dbl>
#> 1 G1 A 0.6
#> 2 G1 B 0.4
out_0 %>% arrange(Class)
#> # A tibble: 2 × 3
#> Group Class RA
#> <chr> <chr> <dbl>
#> 1 G1 A 0.75
#> 2 G1 B 0.25