Customers started getting help inside the app; in their own dialect.
10M+
App users impacted

The client is Saudi Arabia's largest grocery retailer, with more than 10 million app users. We built a conversational AI agent that lives inside their mobile app, on both Android and iOS, in eight weeks. The agent handles natural-language product search in Saudi Arabic dialect, converts a customer's recipe into a basket from a single message, answers product and support questions from the retailer's own knowledge base, and hands off cleanly to a human agent through Genesys Cloud CX when its confidence drops. The architecture was designed to extend to voice and the e-commerce site without rebuilding the AI brain.
What was actually broken
The problem wasn't that the retailer didn't have customer support. It was that the support they had wasn't working the way customers actually shop.
When someone opened the app and searched for "chicken," they got chicken. The keyword match worked. When they typed "I want to make a chicken biryani for six people," the app had no idea what to do. The intent was there. The catalogue was there. The connection between the two was missing.
At the same time, customer support flowed through a call centre and a generic web form, with no triage, no self-service, and no way for a customer to get a fast answer without waiting for a human. Millions of users, thousands of daily interactions, all of it handled the hard way.
The brief was to move the simpler interactions into the app itself, in Arabic, with the kind of contextual help that knows the catalogue and respects how Saudi customers actually talk. "Knows the catalogue" and "respects how Saudis actually talk" turned out to be the two hardest parts.
Why Saudi dialect is hard
Modern Standard Arabic is the language of newspapers, textbooks, and government documents. Almost nobody uses it to talk about groceries.
Saudi dialect is what people actually speak. It has its own vocabulary, its own grammar shortcuts, its own preferred words for everyday objects. A customer typing into a grocery app isn't going to ask for "أرز" formally. They might write "ruz" in Latin script, or use a colloquial word for a specific type of rice, or write in a mix of Arabic and English ("ruz basmati 5kg"). They might use Egyptian or Levantine words because they grew up watching Egyptian and Lebanese television. They might use Saudi-specific terms that don't appear in any Arabic textbook.
Off-the-shelf Arabic NLP doesn't handle this well. The base models are trained predominantly on MSA and a handful of major dialects. Saudi-specific phrasing falls into the gaps. We've seen models that handle MSA at 95 percent accuracy fall to 60 percent on Saudi dialect for the same intents.
The fix was two-part. We built a careful dataset of real Saudi grocery queries, sourced with the client from their anonymised support history. Not synthetic. Actual customer messages, labelled by intent. That gave us ground truth for what real customers actually type. Then we tuned the prompt and the retrieval layer for the kinds of dialectal variation that dataset surfaced. The model doesn't try to normalise the input into MSA before processing. It works with the dialect as written. Teaching it which dialectal terms map to which catalogue items was mostly a retrieval problem, not a generation problem.
Why "knows the catalogue" is harder than it sounds
A grocery catalogue has tens of thousands of SKUs. Many are nearly indistinguishable to a non-expert: three brands of basmati rice in five-kilogram bags, two on promotion this week, one out of stock for three days at this specific store.
The customer doesn't know any of this. They want rice. The agent has to figure out which rice, in stock at the right store, at a price the customer is going to accept.
We built the catalogue layer as a typed retrieval system. The agent doesn't generate product names; it queries a live index and returns specific SKUs with stock status, price, and any current promotion. The model handles the language. The catalogue handles the catalogue. Keeping that separation clean was important: if the model is allowed to make up product names, you ship a tool that confidently recommends products that don't exist. We've seen that failure mode in production AI elsewhere. We weren't going to repeat it.
The catalogue layer also does fuzzy matching across dialectal variation. A customer asking for "خبز عربي" (Arabic bread) gets the same result set as a customer asking for "khubz" or "pita" or "Saudi flatbread." The synonym map is curated, not magical: we built it with the retailer's category managers, who know which products customers actually substitute for each other in practice.
The recipe-to-basket feature
The most fun feature, and the most useful one, is recipe-to-basket.
A customer types "I'm making kabsa for six people tonight." The agent identifies kabsa as a Saudi rice dish, generates the ingredient list scaled to six servings, maps each ingredient to a specific SKU in the catalogue, and returns a basket the customer can review and check out.
This sounds straightforward. It isn't. Kabsa, like most regional dishes, has variations. The Riyadh version isn't quite the Hijazi version. The ingredients a Saudi grandmother would use aren't always the ingredients a younger cook would buy. The catalogue might not have the specific brand a customer was expecting. The customer might have allergies, dietary restrictions, or preferences the agent can't infer from "kabsa for six."
The design choice was to be explicit about all of it. The agent generates the basket and shows its reasoning: here's the dish I think you mean, here's the variation I'm assuming, here are the ingredients scaled to six, here are the SKUs I'm picking, here's why. The customer can edit any of it before checkout.
The alternative, having the agent silently make all these choices and produce a basket, got tested in early prototypes and didn't work. Customers either accepted baskets they didn't actually want, or they bounced off the experience because they didn't trust what was being added. Showing the reasoning made the difference.
Handing off to a human, properly
The agent doesn't know everything. The interesting design question is what happens when it doesn't.
The wrong answer, and the common one in production AI products, is to make the agent guess and hope. The customer asks something the agent can't handle. The agent produces a plausible-sounding response that's wrong, the customer acts on it, the support team finds out three days later when the customer is angry.
We built a confidence model into the agent. Every response carries a calibrated confidence score against the type of question and the strength of the retrieval. When the score drops below a threshold, the agent stops trying. It tells the customer "Let me connect you to someone who can help," logs the conversation, and routes through Genesys Cloud CX to a human agent with the full context.
The human agent picks up where the agent left off. They see the customer's question, the agent's reasoning, the SKUs it considered, the reason it bailed. They don't have to re-elicit the problem. The handoff feels continuous to the customer.
The confidence threshold is tuned continuously based on the support team's feedback. Too high and the agent passes too many things to humans, defeating the point. Too low and the agent answers things it shouldn't.
What shipped in eight weeks
Eight weeks is short for a project of this shape. A few choices made it possible.
Backend in Python, AI via Google Gemini 2.5 Flash with multilingual reranking, vector retrieval targeting under 200 milliseconds, and custom Kotlin and Swift SDKs wrapping a single Unified Chat API for both platforms. The same AI brain can be connected to voice channels and the e-commerce site without rebuilding, only the channel layer changes.
We worked closely with the retailer's mobile and operations teams from day one, not "after the model was ready." The agent integrates into the existing app, not bolted on. Data flow with the customer support system was designed in parallel with the model work. Integration testing started in week three, not week six.
We used real customer conversations as test data from the start, with anonymisation. The model never saw a synthetic prompt in development. Every prompt it was tested against came from an actual customer interaction.
Where this is now
The agent is live on Android and iOS. The iteration phase is ongoing with the retailer's mobile and operations teams. We're tracking adoption, containment rate (the percentage of conversations the agent fully handles without human help), and basket conversion (the percentage of conversations that produce a checkout). Numbers will publish once the post-integration review is complete.
The thing we can say now is that the team is using the agent's conversation logs to find product gaps and merchandising opportunities they hadn't seen before. The agent isn't just answering questions. It's producing a stream of structured data about what customers are looking for, in their own words, at scale. That second-order benefit is the one most retail AI projects miss.
For anyone considering a similar project: the hard parts aren't the AI. They're the language, the catalogue, the handoff, and the integration. Get those right and the AI part takes care of itself.
Ready to kickstart your project?
Speak with us


