The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Welcome to the world of glycan text parsing! If you’ve ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format.
That’s where glyparse comes to the rescue! 🚀
Think of glyparse as your universal glycan translator — it can read glycan structures written in many different “languages” and convert them all into a unified format that your computer can understand and work with.
Note: All functions in glyparse return glyrepr::glycan_structure objects. If you are unfamiliar with glyrepr, you can read the documentation here.
library(glyparse)Before we dive in, let’s see what we’re dealing with. Here’s the same N-glycan core structure written in different formats:
| Format | Example | Where You’ll See It |
|---|---|---|
| IUPAC-condensed | Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc |
Literature, UniCarbKB |
| IUPAC-short | Mana3(Mana6)Manb4GlcNAcb4GlcNAc |
Literature, UniCarbKB |
| IUPAC-extended | alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc |
Literature, UniCarbKB |
| GlycoCT | Complex multi-line format | Literature, GlycomeDB |
| WURCS | WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1... |
Literature, GlyTouCan |
| Linear Code | Ma3(Ma6)Mb4GNb4GNb |
Literature |
| pGlyco | (N(N(H(H(H))))) |
pGlyco software results |
| StrucGP | A2B2C1D1E2fedcba |
StrucGP software results |
Confusing, right? 😵💫 glyparse understands them all!
glyparse provides seven specialized parsers, each optimized for a specific format:
parse_iupac_condensed(): The most common formatparse_iupac_short(): Compact literature formatparse_iupac_extended(): Verbose formal formatparse_glycoct(): Database standard formatparse_wurcs(): Modern standardized formatparse_linear_code(): Linear Code formatparse_pglyco_struc(): pGlyco software formatparse_strucgp_struc(): StrucGP software formatAll parsers follow the same pattern:
glyrepr::glycan_structure object that you can analyzeauto_parse()Don’t know what you’re dealing with? Give it to auto_parse()! This function tries to identify the format automatically and use the appropriate parser. Even input with mixed formats is supported.
x <- c(
"Gal(b1-3)GalNAc(b1-",
"(N(F)(N(H(H(N))(H(N(H))))))",
"WURCS=2.0/3,3,2/[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/1-2-3/a4-b1_b3-c1"
)
auto_parse(x)
#> <glycan_structure[3]>
#> [1] Gal(b1-3)GalNAc(b1-
#> [2] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [3] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 3Let’s start with the IUPAC formats.
This format is widely used in scientific literature and databases like UniCarbKB.
Want to know more about IUPAC-condensed format? Check this out!
# Single structure
iupac_condensed <- "Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-4)Gal(b1-4)Glc(a1-"
parse_iupac_condensed(iupac_condensed)
#> <glycan_structure[1]>
#> [1] Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-4)Gal(b1-4)Glc(a1-
#> # Unique structures: 1# Multiple structures at once
glycans <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(b1-", # O-glycan core 1
"Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-" # O-glycan core 2
)
parse_iupac_condensed(glycans)
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(b1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-
#> # Unique structures: 3This compact format is popular in research papers because it saves space:
# The same structures in short format
iupac_short <- c(
"Mana3(Mana6)Manb4GlcNAcb4GlcNAcb-",
"Galb3GalNAcb-",
"Neu5Aca3Galb3(GlcNAcb6)GalNAcb-"
)
parse_iupac_short(iupac_short)
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(b1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-
#> # Unique structures: 3Notice how much more compact this is! The parser is smart enough to infer common linkage positions (like Neu5Ac always being a2-linked).
This verbose format includes full chemical names and stereochemistry:
iupac_extended <- paste0(
"α-D-Manp-(1→3)[α-D-Manp-(1→6)]-β-D-Manp-(1→4)",
"-β-D-GlcpNAc-(1→4)-β-D-GlcpNAc-(1→"
)
parse_iupac_extended(iupac_extended)
#> <glycan_structure[1]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> # Unique structures: 1GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It’s more complex but extremely precise:
glycoct <- paste0(
"RES\n",
"1b:b-dglc-HEX-1:5\n",
"2b:b-dgal-HEX-1:5\n",
"3b:a-dgal-HEX-1:5\n",
"LIN\n",
"1:1o(4+1)2d\n",
"2:2o(3+1)3d"
)
parse_glycoct(glycoct)
#> <glycan_structure[1]>
#> [1] Gal(a1-3)Gal(b1-4)Glc(b1-
#> # Unique structures: 1WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan:
wurcs <- paste0(
"WURCS=2.0/3,3,2/",
"[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/",
"1-2-3/a4-b1_b3-c1"
)
parse_wurcs(wurcs)
#> <glycan_structure[1]>
#> [1] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 1Linear Code is a simplified format used in literature for complex structures:
linear_code <- "Ma3(Ma6)Mb4GNb4GNb"
parse_linear_code(linear_code)
#> <glycan_structure[1]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> # Unique structures: 1If you work with glycoproteomics, you might encounter pGlyco’s parenthetical notation:
pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))"
parse_pglyco_struc(pglyco)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1This cryptic notation actually represents a complex N-glycan:
StrucGP uses a letter-based encoding system:
strucgp <- "A2B2C1D1E2F1fedD1E2edcbB5ba"
parse_strucgp_struc(strucgp)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1glyparse transforms the chaos of glycan text formats into order. No matter where your glycan data comes from, databases, literature, or software tools, you can now parse it into glyrepr::glycan_structure() for further analysis. In fact, glyread package uses these parsing functions internally when reading output from common glycopeptide identification softwares.
Next steps:
glyrepr package for structure manipulationglymotif for motif analysis of your parsed structuresglyexp for experimental data analysisglycoverse ecosystem!Happy parsing! 🧬✨
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.