Back

How We Localized 600 Marketplace Listings Across 8 Languages in 2026: The Step-by-Step Workflow That Survived Every Platform Review

avatar
03 Jun 20263 min read
Share with
  • Copy link

When a multi-account operation starts losing listings, the first instinct is to blame the infrastructure. A flagged storefront, a suppressed product page, a marketplace account that suddenly stops converting: the usual suspects are the IP, the fingerprint, or the proxy. But after our team localized a 600-listing catalog across eight markets last quarter, the quietest cause of trouble turned out to be none of those. It was the words inside the listing.

A product title that reads as awkward, duplicated, or machine-stamped in the local language does not just convert poorly. On several marketplaces it trips the same quality and authenticity checks that operators work so hard to avoid at the account level. Below is the workflow we built to take one English catalog into eight languages without a single listing being pulled in review, and the parts of it we would keep if we had to start over tomorrow.

Where multilingual listings actually break

Before writing a line of localized copy, it helps to know exactly where the failures happen. We saw three.

The first is review. Marketplace quality systems increasingly read unnatural phrasing as a low-quality or duplicated listing, which is the kind of signal that suppresses a page rather than rejecting it outright, so you lose visibility without ever getting a clear reason why. The second is drift. Across a 600-item catalog, the same source term can come out three different ways in the target language across three sessions, and that inconsistency reads as a sloppy or untrustworthy store. The third is compliance. In regulated categories, a single mistranslated claim or spec is not a style issue, it is a liability.

None of this is hypothetical. Independent benchmarking found that baseline systems average between ten and fifteen errors per text before any structured review, and academic work on machine translation has shown that every model carries its own distinct failure patterns, with longer passages raising the risk further. That last point matters for catalogs specifically, because product descriptions are exactly the long, detail-dense passages where a single model is most likely to slip.

It also matters because, as anyone scaling a store knows, each market brings its own language and buyer expectations. You are not translating one catalog once. You are producing a coherent, native-reading store eight separate times.

Step 1: We locked terminology before translating a line

The first thing we did was resist the urge to translate. Instead we built a term base: a fixed list of every product name, spec unit, material, and brand term, with the one approved rendering for each in every target language. If a model later produced a different version, it was wrong by definition, not by opinion.

This step is unglamorous and it is the one that saved us the most rework. For operators running stores across Amazon, eBay, Lazada, Shopee, and Shopify, the term base does double duty: it keeps each store internally consistent, and it gives every later quality check a clear standard to measure against. Decide what must never vary before you let any automated system touch the copy.

Step 2: We stopped trusting any single model

Here is the shift that changed our numbers. The research above points to one conclusion: hallucinations are model-specific. The model that nails German tone may fabricate a measurement in Polish; the one that handles Romance languages cleanly may flatten a formal register elsewhere. If you pick one model, you inherit its specific blind spots, and you do not find out where they are until a listing is already live.

So we stopped picking. Instead of choosing a single output, we compared many and kept the one that survived agreement. In practice that meant moving the catalog through MachineTranslation.com, an AI translator which compares the outputs of 22 AI models and selects the translation that most of them agree on. The logic is simple: an error one model invents is unlikely to be repeated independently by the majority, so the majority output filters out the idiosyncratic mistakes that cause suppression and drift.

The effect on the error profile was the part we did not fully expect. Individual top-tier models hallucinate or fabricate content between 10% and 18% of the time on translation tasks, according to data synthesized from Intento and the WMT24 findings alongside internal benchmarks. Routed through majority agreement across 22 models, that figure drops to under 2%, roughly a 90% reduction in critical error risk. For a catalog, the more telling number was consistency: the same internal benchmarks put terminology and register consistency above 96% across multi-document workflows, against an industry baseline near 78% for single-model output at the same volume. That gap is the difference between a store that reads as one coherent seller and one that reads as eight different people guessing.

Step 3: We sent only the disagreements to a human

Comparison does not solve everything. When the models split evenly, or when the segment sat in a regulated category, majority output was not a strong enough guarantee. So we routed only that subset to a human reviewer inside the same platform, rather than re-checking all 600 listings by hand.

This is where the workflow earns its efficiency. Human review on the full catalog would have erased any reason to use AI at all. Human review on the narrow slice where models disagreed gave us a 100% accuracy guarantee exactly where the stakes were highest, while leaving the clear-cut majority of the catalog to move at machine speed. The European data made the triage worth it: single models plateau around 84% to 87% accuracy for French, German, and Spanish and fall to roughly 76% for Polish, while the compared-and-verified approach held 93% to 95% across Western and Southern Europe and lifted Polish to 88%.

Step 4: We localized market by market, inside isolated profiles

The final discipline was operational, not linguistic. We localized one market at a time and kept each store's working environment fully separated, the same way a careful operator already isolates accounts to prevent linking. The benefit here is not only safety, it is content integrity. Working one market at a time, against one locked term base, keeps a store reading as a single coherent seller instead of a patchwork.

The principle that protects you from account linking, keeping each identity clean and self-contained, is the same principle that keeps a localized catalog consistent. Treat each market's content as its own closed system with one source of truth, and you stop importing the small inconsistencies that quality systems are trained to notice.

What the workflow changed

Across all eight languages and 600 listings, nothing was pulled in review. Terminology held consistent above the 96% mark, which meant near-zero post-launch corrections for naming and specs. The rework that usually swallows a localization project, the back-and-forth of fixing inconsistent product names after they are already live, mostly disappeared, because the term base and the majority check caught those problems before publication rather than after.

“The mistake we see operators make is treating localization as a translation problem when it is really a risk problem,” says Ofer Tirosh, CEO of Tomedes. “The question is not which model is best. It is how you keep one bad output from ever reaching a live listing. Comparing many models and escalating only the disagreements is how you do that at scale.”

If you are localizing at scale, start here

If you take four things from how we ran this, take these. Build the term base before you translate, so every later check has a standard. Treat any single model's output as a first draft, never a final answer, because its blind spots are invisible until they cost you a listing. Escalate only the disagreements to a human, so review stays affordable. And localize market by market inside separated environments, so consistency and account hygiene reinforce each other instead of competing.

Cross-border selling rewards operators who look professional in every market they enter. The accounts and the infrastructure get most of the attention, but the words are what the buyer and the marketplace actually read. Get the words right at scale, and a lot of the problems you were bracing for never arrive.

Related articles