GLAMR Imports

This page provides an import overview.

Sample metadata

Spreadsheet

Information about samples in GLAMR is primarily collected in an excel spreadsheet primarily accessed via Google sheets. This sheet serves as the primary authority on sample IDs (see below) and is used to import data for processing by the GLAMR bioinformatics pipelines. It also serves as the primary record for most sample metadata such as collection time, location, and associated environmental measurements. Because of its critical central role, edit access is restricted to those who have been trained on GLAMR data entry.

GLAMR specific IDs

Bio-samples: IDs follow the format “bios_{numeric}”
- These are analogous to BioSamples in NCBI SRA and contain
Observations: IDs follow the format “samp_{numeric})”
- These are analogous to runs in NCBI SRA and are unique observations of a bios that must be associated with a bios_xxx ID. In most cases they are a unique sequencing effort, e.g. metagenome, amplicon target, or transcriptomic data. A single bio-sample may have multiple observations associated with it to account for different sequencing methodologies or primer sets. Observations can also represent results from other assays like metabolomics.
Sets / studies / projects: IDs follow the format “set_{numeric}”
- Bio-samples can be grouped into sets, typically associated with a particular project or paper.
Papers: IDs follow the format “paper_{numeric})”
- Whenever possible, samples are linked to associated papers.

Local Import

Content for importing local data goes here…

NCBI SRA Import

Content for importing from NCBI SRA goes here…

Directory Structure

Samples

GLAMR primarily follows a sample-centric workflow, and thus files are organized into sample directories. These are stored in data/omics/{sample_type}s/{SampleID}

Projects

To organize samples and facilitate easier browsing, sample directories are also symbolically linked into project folders using this file structure: data/projects/{sample_type}s/{SampleID}

Reference data

Reference data used by the GLAMR pipelines is stored in data/reference including:

Taxonomic annotation:
- kraken
- sourmash
Functional annotation:
- koFamScan
- UniRef100
- Bakta
Bin QC and annotation:
- CheckM
- checkm2
- GTDBtk
- semibin
- gunc
Specialized annotations:
- deeparg
- genomad
- virsorter
- AntiSmash
- BiG-SCAPE
- Custom BLAST queries

Per-sample outputs

Key pipeline outputs and their output locations can be found on the pages for the respective data types.