InChI vs SMILES chemical identifier formats

InChI vs SMILES: When to Use Which Chemical Identifier Format

InChI vs SMILES: SMILES round-trips structures fast; InChIKey is the universal database key. Decision rules by workflow plus the stereo gotcha to avoid.

ChemStitchMay 18, 2026

SMILES is what you paste into a tool. InChI is what you search in a database. Both are line-notation identifiers for chemical structures; they look superficially similar (a string of characters that round-trips to a 2D structure) but they serve different jobs and break in different ways. Picking the wrong one for a workflow — SMILES into a PubChem deduplication, InChI into an AI retrosynthesis tool — produces silent failures that are hard to diagnose later.

This post compares InChI and SMILES on the dimensions that matter at the bench: structure round-trip, canonicalization, stereochemistry handling, database lookup, and tool ecosystem support. The SMILES reference post covers the syntax in detail; this one is the picking-between-them post.

What each format is, in one sentence

SMILES (Simplified Molecular Input Line Entry System, Weininger 1988) is a human-readable graph traversal. Atoms appear in the order you would walk through the structure; bonds, branches, and ring closures are punctuation. CCO is ethanol; c1ccccc1 is benzene. The OpenSMILES specification is the open community-maintained reference; the original Daylight spec is the historical canon.

InChI (International Chemical Identifier, IUPAC 2005) is a machine-generated canonical identifier produced by a single official algorithm. It is a layered structure: a formula layer, a connections layer, a hydrogen layer, and optional charge, stereo, isotope, fixed-H, and reconnected layers. InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 is ethanol. The InChI Trust maintains the algorithm and software.

Criteria that matter for picking between them

Six dimensions practitioners care about when these identifiers move between tools, databases, and publications:

Round-trip: can the identifier reliably be converted to and from a 2D structure without information loss?
Canonicalization: do two valid identifiers for the same molecule match exactly, or do they differ depending on how the structure was drawn?
Stereochemistry handling: does the identifier capture R/S and E/Z reliably?
Database lookup: which identifier do PubChem, ChemSpider, and patent databases use as their internal key?
Tool support: do the editors, AI tools, and scripts in your workflow accept both?
Human readability: can a chemist sanity-check the identifier by eye?

Round-trip: SMILES is forgiving, InChI is opinionated

A SMILES string is a description of a graph traversal. Many valid SMILES describe the same molecule — ethanol is CCO, OCC, C(C)O, and C(O)C. Round-tripping a SMILES through a tool and back gives you *a* valid SMILES for the molecule, not necessarily the one you started with. This is usually fine: the structure is preserved even if the string changes.

InChI is the opposite. The algorithm produces exactly one InChI string per molecule, by construction. Round-tripping is deterministic: every tool that implements InChI correctly produces the same string for the same molecule. The cost is that the string is unreadable — you cannot reconstruct the structure by eye the way you can with SMILES. The benefit is that string comparison is structure comparison.

The canonical-SMILES caveat RDKit, OpenBabel, and other toolkits produce canonical SMILES — an algorithmically determined "one true" SMILES per molecule. Different toolkits use different canonicalization algorithms, so RDKit canonical SMILES and OpenBabel canonical SMILES for the same molecule may differ. Canonical SMILES from the same toolkit are comparable; canonical SMILES across toolkits are not. InChI does not have this problem because there is only one algorithm.

Canonicalization and dedup

If you have a list of 10,000 compounds from a vendor file and you want to find duplicates, the identifier comparison strategy matters:

Comparing raw input SMILES catches only string-identical duplicates — misses the same molecule entered with different SMILES.
Comparing canonical SMILES from one toolkit catches structural duplicates within that toolkit’s conventions — but cross-toolkit comparison is unreliable.
Comparing InChI catches structural duplicates universally — this is what PubChem and ChemSpider use internally.

For dedup, InChI is the default choice. The shorter InChIKey (a fixed-length hash of the InChI, 27 characters split as XX-XX-X) is what databases actually index because it fits in a database column and supports fast lookup. LFQSCWFLJHTTHZ-UHFFFAOYSA-N is the InChIKey for ethanol. PubChem’s API accepts InChIKey directly; PubChem programmatic access docs describe the lookup endpoints.

Stereochemistry

Both formats handle stereochemistry, but with different conventions and different failure modes.

SMILES stereochemistry: tetrahedral chirality with @ and @@ on the atom; E/Z geometry with / and \ on the bond. F/C=C/F is trans-1,2-difluoroethylene; F/C=C\F is cis. The convention is local — the symbol describes the bond geometry relative to its neighbors. Writing SMILES with stereochemistry by hand is error-prone; tools that round-trip stereo are reliable but the human-eye check is hard.

InChI stereochemistry: the stereo layer (/t and /b) is computed from the structure and is part of the canonical form. R/S assignments are unambiguous because the algorithm uses a fixed atom-numbering rule. The cost is that you cannot read R or S off the InChI by eye — you get the algorithm’s normalized representation, not the chemist’s CIP-rules answer in human terms.

Stereochemistry round-trip gotcha A SMILES string that omits stereo is interpreted as "stereochemistry unspecified" — the molecule is the racemate or undefined. If you generate InChI from such a SMILES, the InChI also lacks the stereo layer. Comparing InChIKeys, the racemate and the single enantiomer have different InChIKeys (different connectivity layer / standard InChI variants). Searching PubChem with the connectivity-only InChIKey prefix (first 14 characters) finds both; searching with the full InChIKey finds only one. This trips people up when they expect "the same molecule" to match.

Database lookup — which one PubChem actually uses

PubChem stores compounds keyed on InChIKey. SMILES are accepted as a query format (the API parses them, generates the InChI internally, then looks up the InChIKey) but InChIKey is the native key. ChemSpider, ChEMBL, DrugBank, and the EPA CompTox dashboard all use InChIKey as their internal identifier. If your workflow involves cross-referencing compounds across databases, InChIKey is the lingua franca.

SMILES is what you paste into a structure editor (Ketcher, ChemDraw, MarvinSketch, ChemStitch) because it round-trips to a structure faster — the editor parses the graph traversal directly. The structure-editor comparison post covers which editors handle which input formats. InChI requires an extra round-trip through the InChI software (the official algorithm is in a C library; ChemDraw, ChemStitch, RDKit, and OpenBabel all wrap it).

Tool support

Both formats are widely supported, but the assumed default varies by tool category:

Tool category	Default input	InChI support
Structure editors (ChemDraw, Ketcher, ChemStitch)	SMILES, MOL	Import/export; not the primary surface
Cheminformatics toolkits (RDKit, OpenBabel)	SMILES	Full API for InChI generation and parsing
Public databases (PubChem, ChemSpider, ChEMBL)	InChIKey (internal); SMILES (query)	Primary identifier
Patent databases (Reaxys, SciFinder)	Structure drawing; SMILES query	InChIKey indexed
AI tools (retrosynthesis, property prediction)	SMILES	Variable; some accept InChI, most expect SMILES
ELN systems (Benchling, Signals, Dotmatics)	SMILES, MOL	Import/export

The rough pattern: if you are working with a molecule (drawing, predicting, sending to AI), reach for SMILES. If you are storing or looking up a molecule (database key, dedup, cross-reference), reach for InChI or InChIKey.

Human readability

SMILES is readable enough that a chemist can recognize structural features by eye. c1ccc2ccccc2c1 is naphthalene — the trained eye sees two fused aromatic rings in the lowercase c letters and the ring-closure digits. Reading SMILES is a skill experienced chemists develop because it makes copy-paste workflows trustworthy — you can spot a wrong structure before it propagates.

InChI is not designed for human reading. The layered structure encodes information for the algorithm to parse, not for a chemist to interpret. The hydrogen layer, the stereo layer, and the isotope layer are computed representations. Looking at InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H and identifying naphthalene by eye is possible (C10H8 in the formula layer, all sp2 hydrogens) but slow. For sanity-checking, SMILES wins.

Verdict by reader-type

If you are drawing structures or sending them to AI tools: SMILES. It is what every editor and AI tool expects, and it round-trips fast. Use canonical SMILES from a single toolkit if you need same-toolkit dedup.
If you are storing structures in a database or deduplicating a vendor file: InChIKey. It is universal, it is short enough to index, and it is what PubChem and ChemSpider use internally.
If you are publishing in a chemistry journal: both. The convention since the mid-2010s is to provide SMILES (for readability) and InChI / InChIKey (for unambiguous identification) in the supporting information.
If you are doing patent work: InChIKey for database searches, MOL files for figure-quality structure rendering. SMILES is acceptable but not the default in IP workflows.
If you are writing a script: pick the one your input source uses. Convert with RDKit or OpenBabel when you need to switch.

Standard InChI vs. non-standard InChI Standard InChI uses the default layer flags and is what every public database produces. Non-standard InChI includes optional flags (fixed-H, tautomer-specific representations) that change the canonical form. Two molecules can have different non-standard InChIs but identical standard InChIs. Always specify "standard InChI" in publications and database queries unless you have a specific reason to use non-standard variants.

Where each format breaks

SMILES limitations: aromatic perception varies across toolkits (Daylight model vs RDKit model vs OpenSMILES) — a SMILES that round-trips in one toolkit may not in another. Stereochemistry written by hand is error-prone. Polymers, mixtures, and organometallics have no universally accepted SMILES convention.

InChI limitations: tautomers normalize by default (the keto and enol forms of acetone have the same standard InChI) — this is correct for dedup but wrong if you care about which tautomer was drawn. Stereochemistry edge cases (axial chirality, P-stereocenters, atropisomerism) are encoded but the round-trip to a 2D structure may not preserve the original drawing convention. Salts and mixtures are represented as multi-component InChI but the convention for which component is "primary" varies.

For practical work, the heuristic is: if SMILES and InChI disagree about whether two structures are the same, InChI is usually right but the disagreement is the signal — investigate the structure, do not just pick the answer you prefer.

Converting between them

RDKit (Python): Chem.MolToInchi(mol) and Chem.MolFromInchi(inchi). OpenBabel: obabel -ismi -oinchi at the command line. Online: NCI Chemical Identifier Resolver accepts either input and converts on the fly — useful for one-off lookups without a Python environment.

In ChemStitch, drawing a structure produces both SMILES and InChI as properties of the molecule, alongside MW, logP, and the other RDKit-computed descriptors. The conversion is local (no API call), so cross-format identifiers are always available without leaving the editor.