InChI vs SMILES: When to Use Which Chemical Identifier Format
InChI vs SMILES: SMILES round-trips structures fast; InChIKey is the universal database key. Decision rules by workflow plus the stereo gotcha to avoid.
SMILES is what you paste into a tool. InChI is what you search in a database. Both are line-notation identifiers for chemical structures; they look superficially similar (a string of characters that round-trips to a 2D structure) but they serve different jobs and break in different ways. Picking the wrong one for a workflow — SMILES into a PubChem deduplication, InChI into an AI retrosynthesis tool — produces silent failures that are hard to diagnose later.
This post compares InChI and SMILES on the dimensions that matter at the bench: structure round-trip, canonicalization, stereochemistry handling, database lookup, and tool ecosystem support. The SMILES reference post covers the syntax in detail; this one is the picking-between-them post.
What each format is, in one sentence
SMILES (Simplified Molecular Input Line Entry System, Weininger 1988) is a human-readable graph traversal. Atoms appear in the order you would walk through the structure; bonds, branches, and ring closures are punctuation. CCO is ethanol; c1ccccc1 is benzene. The OpenSMILES specification is the open community-maintained reference; the original Daylight spec is the historical canon.
InChI (International Chemical Identifier, IUPAC 2005) is a machine-generated canonical identifier produced by a single official algorithm. It is a layered structure: a formula layer, a connections layer, a hydrogen layer, and optional charge, stereo, isotope, fixed-H, and reconnected layers. InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 is ethanol. The InChI Trust maintains the algorithm and software.
Criteria that matter for picking between them
Six dimensions practitioners care about when these identifiers move between tools, databases, and publications:
- Round-trip: can the identifier reliably be converted to and from a 2D structure without information loss?
- Canonicalization: do two valid identifiers for the same molecule match exactly, or do they differ depending on how the structure was drawn?
- Stereochemistry handling: does the identifier capture R/S and E/Z reliably?
- Database lookup: which identifier do PubChem, ChemSpider, and patent databases use as their internal key?
- Tool support: do the editors, AI tools, and scripts in your workflow accept both?
- Human readability: can a chemist sanity-check the identifier by eye?
Round-trip: SMILES is forgiving, InChI is opinionated
A SMILES string is a description of a graph traversal. Many valid SMILES describe the same molecule — ethanol is CCO, OCC, C(C)O, and C(O)C. Round-tripping a SMILES through a tool and back gives you *a* valid SMILES for the molecule, not necessarily the one you started with. This is usually fine: the structure is preserved even if the string changes.
InChI is the opposite. The algorithm produces exactly one InChI string per molecule, by construction. Round-tripping is deterministic: every tool that implements InChI correctly produces the same string for the same molecule. The cost is that the string is unreadable — you cannot reconstruct the structure by eye the way you can with SMILES. The benefit is that string comparison is structure comparison.
Canonicalization and dedup
If you have a list of 10,000 compounds from a vendor file and you want to find duplicates, the identifier comparison strategy matters:
- Comparing raw input SMILES catches only string-identical duplicates — misses the same molecule entered with different SMILES.
- Comparing canonical SMILES from one toolkit catches structural duplicates within that toolkit’s conventions — but cross-toolkit comparison is unreliable.
- Comparing InChI catches structural duplicates universally — this is what PubChem and ChemSpider use internally.
For dedup, InChI is the default choice. The shorter InChIKey (a fixed-length hash of the InChI, 27 characters split as XX-XX-X) is what databases actually index because it fits in a database column and supports fast lookup. LFQSCWFLJHTTHZ-UHFFFAOYSA-N is the InChIKey for ethanol. PubChem’s API accepts InChIKey directly; PubChem programmatic access docs describe the lookup endpoints.
Stereochemistry
Both formats handle stereochemistry, but with different conventions and different failure modes.
SMILES stereochemistry: tetrahedral chirality with @ and @@ on the atom; E/Z geometry with / and \ on the bond. F/C=C/F is trans-1,2-difluoroethylene; F/C=C\F is cis. The convention is local — the symbol describes the bond geometry relative to its neighbors. Writing SMILES with stereochemistry by hand is error-prone; tools that round-trip stereo are reliable but the human-eye check is hard.
InChI stereochemistry: the stereo layer (/t and /b) is computed from the structure and is part of the canonical form. R/S assignments are unambiguous because the algorithm uses a fixed atom-numbering rule. The cost is that you cannot read R or S off the InChI by eye — you get the algorithm’s normalized representation, not the chemist’s CIP-rules answer in human terms.
Database lookup — which one PubChem actually uses
PubChem stores compounds keyed on InChIKey. SMILES are accepted as a query format (the API parses them, generates the InChI internally, then looks up the InChIKey) but InChIKey is the native key. ChemSpider, ChEMBL, DrugBank, and the EPA CompTox dashboard all use InChIKey as their internal identifier. If your workflow involves cross-referencing compounds across databases, InChIKey is the lingua franca.
SMILES is what you paste into a structure editor (Ketcher, ChemDraw, MarvinSketch, ChemStitch) because it round-trips to a structure faster — the editor parses the graph traversal directly. The structure-editor comparison post covers which editors handle which input formats. InChI requires an extra round-trip through the InChI software (the official algorithm is in a C library; ChemDraw, ChemStitch, RDKit, and OpenBabel all wrap it).
Tool support
Both formats are widely supported, but the assumed default varies by tool category:
| Tool category | Default input | InChI support |
|---|---|---|
| Structure editors (ChemDraw, Ketcher, ChemStitch) | SMILES, MOL | Import/export; not the primary surface |
| Cheminformatics toolkits (RDKit, OpenBabel) | SMILES | Full API for InChI generation and parsing |
| Public databases (PubChem, ChemSpider, ChEMBL) | InChIKey (internal); SMILES (query) | Primary identifier |
| Patent databases (Reaxys, SciFinder) | Structure drawing; SMILES query | InChIKey indexed |
| AI tools (retrosynthesis, property prediction) | SMILES | Variable; some accept InChI, most expect SMILES |
| ELN systems (Benchling, Signals, Dotmatics) | SMILES, MOL | Import/export |
The rough pattern: if you are working with a molecule (drawing, predicting, sending to AI), reach for SMILES. If you are storing or looking up a molecule (database key, dedup, cross-reference), reach for InChI or InChIKey.
Human readability
SMILES is readable enough that a chemist can recognize structural features by eye. c1ccc2ccccc2c1 is naphthalene — the trained eye sees two fused aromatic rings in the lowercase c letters and the ring-closure digits. Reading SMILES is a skill experienced chemists develop because it makes copy-paste workflows trustworthy — you can spot a wrong structure before it propagates.
InChI is not designed for human reading. The layered structure encodes information for the algorithm to parse, not for a chemist to interpret. The hydrogen layer, the stereo layer, and the isotope layer are computed representations. Looking at InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H and identifying naphthalene by eye is possible (C10H8 in the formula layer, all sp2 hydrogens) but slow. For sanity-checking, SMILES wins.
Verdict by reader-type
- If you are drawing structures or sending them to AI tools: SMILES. It is what every editor and AI tool expects, and it round-trips fast. Use canonical SMILES from a single toolkit if you need same-toolkit dedup.
- If you are storing structures in a database or deduplicating a vendor file: InChIKey. It is universal, it is short enough to index, and it is what PubChem and ChemSpider use internally.
- If you are publishing in a chemistry journal: both. The convention since the mid-2010s is to provide SMILES (for readability) and InChI / InChIKey (for unambiguous identification) in the supporting information.
- If you are doing patent work: InChIKey for database searches, MOL files for figure-quality structure rendering. SMILES is acceptable but not the default in IP workflows.
- If you are writing a script: pick the one your input source uses. Convert with RDKit or OpenBabel when you need to switch.
Where each format breaks
SMILES limitations: aromatic perception varies across toolkits (Daylight model vs RDKit model vs OpenSMILES) — a SMILES that round-trips in one toolkit may not in another. Stereochemistry written by hand is error-prone. Polymers, mixtures, and organometallics have no universally accepted SMILES convention.
InChI limitations: tautomers normalize by default (the keto and enol forms of acetone have the same standard InChI) — this is correct for dedup but wrong if you care about which tautomer was drawn. Stereochemistry edge cases (axial chirality, P-stereocenters, atropisomerism) are encoded but the round-trip to a 2D structure may not preserve the original drawing convention. Salts and mixtures are represented as multi-component InChI but the convention for which component is "primary" varies.
For practical work, the heuristic is: if SMILES and InChI disagree about whether two structures are the same, InChI is usually right but the disagreement is the signal — investigate the structure, do not just pick the answer you prefer.
Converting between them
RDKit (Python): Chem.MolToInchi(mol) and Chem.MolFromInchi(inchi). OpenBabel: obabel -ismi -oinchi at the command line. Online: NCI Chemical Identifier Resolver accepts either input and converts on the fly — useful for one-off lookups without a Python environment.
In ChemStitch, drawing a structure produces both SMILES and InChI as properties of the molecule, alongside MW, logP, and the other RDKit-computed descriptors. The conversion is local (no API call), so cross-format identifiers are always available without leaving the editor.