Prepares long-format data for barplot creation at a specified taxonomic level. Handles unknown taxonomy, groups low-abundance taxa, and aggregates relative abundances (RA). Works with data that already has RA (RA in the 0-1 range) calculated at ASV level.
Usage
process_barplot_data(
data,
tax_level,
group_vars,
SampleID_col = "SampleID",
low_abundance_threshold = 0.01,
low_abundance_basis = c("post_aggregation", "per_sample"),
keep_ratype = c("collapse", "separate"),
tax_original_suffix = "_original",
preserve_higher_taxonomy = FALSE,
clean_taxonomy = TRUE,
hierarchy = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"),
clean_levels = c("Phylum", "Class", "Order", "Family", "Genus"),
agg_fun = c("sum", "mean"),
normalize_by = NULL,
drop_zero = TRUE,
unknown_label = "unknown",
low_label = "low abundant"
)Arguments
- data
A data frame with RA (relative abundance) already calculated per-ASV.
- tax_level
Character string specifying the taxonomic level to plot (e.g., "Family", "Genus").
- group_vars
Character vector of grouping variables (e.g. SampleID, SampleType).
- SampleID_col
Character string specifying the sample ID column (default: "SampleID")
- low_abundance_threshold
Numeric threshold below which taxa are considered low abundance (default: 0.01).
- low_abundance_basis
When to identify low-abundance taxa:
- "per_sample"
Mark taxa as low abundant at individual sample level BEFORE aggregation
- "post_aggregation"
Mark taxa as low abundant AFTER aggregating across samples (default)
- keep_ratype
Controls whether low-abundance taxa are collapsed or kept separate:
- "collapse"
Relabel taxa below threshold as "low abundant" and collapse into one bin (default)
- "separate"
Create
<tax_level>_original; keep low/unknown as plot labels but do not merge originals
- tax_original_suffix
Suffix used to create the original-taxon column when
keep_ratype="separate"(default: "_original"). Example: tax_level="Class" -> "Class_original".- preserve_higher_taxonomy
Logical, whether to keep higher taxonomic levels (default: FALSE).
- clean_taxonomy
Logical, whether to clean taxonomy using replace_incertae_sedis_NAs (default: TRUE).
- hierarchy
Taxonomic hierarchy for cleaning (default: c("Domain", "Phylum", "Class", "Order", "Family", "Genus")).
- clean_levels
Levels to clean (default: c("Phylum", "Class", "Order", "Family", "Genus")). Uses
replace_incertae_sedis_NAs- agg_fun
Aggregation function for RA when summarizing groups:
"sum"or"mean"(default: "sum").- normalize_by
Character vector (or single string) giving grouping columns within which RA should sum to 1. If NULL: normalize within
SampleID_colif present, else withingroup_varsexcluding taxonomy columns.- drop_zero
Logical; if TRUE, drop rows with
RA <= 0at the end (default: TRUE). Useful to avoid inflating plotting keys with structural zeros.- unknown_label
Label used for unknown taxonomy (default: "unknown").
- low_label
Label used for low-abundance bin (default: "low abundant").
Value
A data frame with processed relative abundances at specified taxonomic level.
- RA
Relative abundance after aggregation and normalization (0–1).
<tax_level>Taxon labels used for plotting (may include
low_label/unknown_label).<tax_level>_originalOnly when
keep_ratype="separate": original taxon identity.<Higher>_trueOnly when
preserve_higher_taxonomy=TRUEandkeep_ratype="separate": true higher ranks.
Low abundance handling (controlled by keep_ratype)
- collapse
Relabel taxa below the threshold as "low abundant" and collapse into a single bin.
- separate
Keep each original taxon in
<tax_level>_original, while<tax_level>is replaced bylow_label/unknown_labelfor flagged taxa.
Higher taxonomy handling (when preserve_higher_taxonomy = TRUE)
- collapse
Higher taxonomy columns are also relabeled to "low abundant"/"unknown".
- separate
True higher taxonomy is re-attached via
<tax_level>_originalinto<Higher>_truecolumns (e.g.Phylum_true).
Examples
library(dplyr)
# Minimal toy dataset (already aggregated at ASV level with RA in 0–1 range)
toy <- tibble::tibble(
SampleID = c("S1","S1","S2","S2"),
Group = c("A","A","A","A"),
Phylum = c("P1","P2","P1","P2"),
Class = c("C1","C2","C1","C2"),
RA = c(0.9, 0.1, 0.6, 0.4)
)
# Collapse low-abundance taxa into a single bin
out_collapse <- process_barplot_data(
data = toy,
tax_level = "Class",
group_vars = "SampleID",
low_abundance_threshold = 0.2,
keep_ratype = "collapse",
clean_taxonomy = FALSE
)
# Keep low-abundance taxa separate but flagged
out_separate <- process_barplot_data(
data = toy,
tax_level = "Class",
group_vars = "SampleID",
low_abundance_threshold = 0.2,
keep_ratype = "separate",
clean_taxonomy = FALSE
)
# In separate mode, an additional column "Class_original" appears
names(out_separate)
#> [1] "SampleID" "Class" "Class_original" "RA"