The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting Started with glyparse

Your Universal Glycan Text Translator 🔄

Welcome to the world of glycan text parsing! If you’ve ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format.

That’s where glyparse comes to the rescue! 🚀

Think of glyparse as your universal glycan translator — it can read glycan structures written in many different “languages” and convert them all into a unified format that your computer can understand and work with.

Note: All functions in glyparse return glyrepr::glycan_structure objects. If you are unfamiliar with glyrepr, you can read the documentation here.

library(glyparse)

The Babel Tower of Glycan Text Formats 🗼

Before we dive in, let’s see what we’re dealing with. Here’s the same N-glycan core structure written in different formats:

Format Example Where You’ll See It
IUPAC-condensed Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc Literature, UniCarbKB
IUPAC-short Mana3(Mana6)Manb4GlcNAcb4GlcNAc Literature, UniCarbKB
IUPAC-extended alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc Literature, UniCarbKB
GlycoCT Complex multi-line format Literature, GlycomeDB
WURCS WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1... Literature, GlyTouCan
Linear Code Ma3(Ma6)Mb4GNb4GNb Literature
pGlyco (N(N(H(H(H))))) pGlyco software results
StrucGP A2B2C1D1E2fedcba StrucGP software results

Confusing, right? 😵‍💫 glyparse understands them all!

Your Parsing Toolkit 🛠️

glyparse provides seven specialized parsers, each optimized for a specific format:

All parsers follow the same pattern:

Part 0: auto_parse()

Don’t know what you’re dealing with? Give it to auto_parse()! This function tries to identify the format automatically and use the appropriate parser. Even input with mixed formats is supported.

x <- c(
  "Gal(b1-3)GalNAc(b1-",
  "(N(F)(N(H(H(N))(H(N(H))))))",
  "WURCS=2.0/3,3,2/[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/1-2-3/a4-b1_b3-c1"
)
auto_parse(x)
#> <glycan_structure[3]>
#> [1] Gal(b1-3)GalNAc(b1-
#> [2] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [3] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 3

Part 2: Database Formats — The Heavy Hitters 💪

GlycoCT: The Precision Format

GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It’s more complex but extremely precise:

glycoct <- paste0(
  "RES\n",
  "1b:b-dglc-HEX-1:5\n",
  "2b:b-dgal-HEX-1:5\n", 
  "3b:a-dgal-HEX-1:5\n",
  "LIN\n",
  "1:1o(4+1)2d\n",
  "2:2o(3+1)3d"
)
parse_glycoct(glycoct)
#> <glycan_structure[1]>
#> [1] Gal(a1-3)Gal(b1-4)Glc(b1-
#> # Unique structures: 1

WURCS: The Complex Structure Format

WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan:

wurcs <- paste0(
  "WURCS=2.0/3,3,2/",
  "[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/",
  "1-2-3/a4-b1_b3-c1"
)
parse_wurcs(wurcs)
#> <glycan_structure[1]>
#> [1] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 1

Linear Code: The Simplified Format

Linear Code is a simplified format used in literature for complex structures:

linear_code <- "Ma3(Ma6)Mb4GNb4GNb"
parse_linear_code(linear_code)
#> <glycan_structure[1]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> # Unique structures: 1

Part 3: Software-Specific Formats — The Specialists 🔬

pGlyco Format: Proteomics Tool

If you work with glycoproteomics, you might encounter pGlyco’s parenthetical notation:

pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))"
parse_pglyco_struc(pglyco)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

This cryptic notation actually represents a complex N-glycan:

StrucGP Format: Alphabetical System

StrucGP uses a letter-based encoding system:

strucgp <- "A2B2C1D1E2F1fedD1E2edcbB5ba"
parse_strucgp_struc(strucgp)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

The Bottom Line 🎯

glyparse transforms the chaos of glycan text formats into order. No matter where your glycan data comes from, databases, literature, or software tools, you can now parse it into glyrepr::glycan_structure() for further analysis. In fact, glyread package uses these parsing functions internally when reading output from common glycopeptide identification softwares.

Next steps:

Happy parsing! 🧬✨

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.