"The great disappointment": Why Smart Agents Stumble over Dirty Data and How a Business Glossary is becoming a Major Asset of the AI Era

Back to All News

"The great disappointment": Why Smart Agents Stumble over Dirty Data and How a Business Glossary is becoming a Major Asset of the AI Era

Article date

03 02 2026

Article Author

Veronika Salagina

Reading Time

5 minutes

The end of the "Magic Pill" era

We are experiencing a unique moment. Two years after the generative AI boom, when every self-respecting C-level manager demanded to implement a "smart agent" in CRM or ERP, the market is entering the stage of a severe hangover. Expectations were warmed up to the limit: we were promised digital assistants who would take over procurement planning, automatic contract approval and impeccable customer service.

The reality turned out to be more prosaic and tougher. Large organisations that have already spent millions of dollars on pilot projects and licenses are increasingly finding the same thing: "smart agents don't work." They produce hallucinations, confuse counterparties, make mutually exclusive predictions, and—worst of all for business—generate convincing lies with perfect grammar but disastrous content.

It's useless to blame algorithms for this. The problem is not in the software, but in the "fuel". AI is a mirror of data. If there is chaos in the mirror, the business does not see the future, but an aggravated mess. In this article, we will go beyond the banal "Garbage In, Garbage Out". We will analyse three key points.:

1. Why the market is overheated and why even $14 billion of investments do not save from dirty data.
2. Why the ideal conditions for AI are not a "sterile database", but a well—built business glossary and a Master Data Management (MDM) system.
3. How AI itself helps to restore order (and a case study of solving the problem of copies/duplicates).

The Ghost of Scale AI: Why Billions Can't Buy Quality

To understand the depth of the crisis, let's look at the case that became the "style icon" of the AI race in 2025. Meta* invested $14.3 billion in a Scale AI startup, a company that was supposed to provide the technology giant with high—quality data for training super-intelligence.

It would seem that the formula for success is: there is an unlimited budget, there are the best minds, there is access to huge computing power. However, two months after the integration of the Scale AI team into Meta* Super-intelligence Labs, personnel changes and conflicts began. Meta* researchers have publicly expressed dissatisfaction: the data from a key partner turned out to be of poor quality. They preferred to work with competitors — Mercor and Surge.

This case is a gentleman's set of symptoms of an entire industry.:

Overheating of the market: Investments put pressure on vendors, demanding quick results. Scale AI, having received a giant contract, could not cope with the quality of the markup.
Bureaucracy and haste: In the race for leadership (especially against the backdrop of lagging behind OpenAI and Google), basic data auditing was sacrificed for speed.
The illusion of "cheap money": Management believed that the quality problem could be solved by simply increasing the budget.

Business conclusion: If an organisation like Meta*, with its engineering culture and access to talent, stumbles over the quality of markup and data, what about an ordinary enterprise from the retail or manufacturing sector? Investing in algorithms without investing in "data hygiene" is like buying a Ferrari for off—road racing. You'll just get bogged down, and the cost of saving that car will be higher than buying a new one.

A Smart Agent in a Crazy World: Where does Digital Noise come from

Why is the data of modern corporations so toxic to AI? The answer lies in the evolution of corporate systems. 20 years ago, the company had one 1C or SAP database. Today, these are hybrid infrastructures: historical ERP, Excel cemeteries (thousands of files exchanged between departments), SaaS CRM, logs of IoT devices, and data from aggregators.

An AI agent, unlike a human, cannot "guess." A person who sees "LLC Romashka", "Ромашка LLC" and "Romashka LLC" will understand that this is one counterparty. For AI — these are three different entities .

The main categories of "noise" that kills AI:

1. Duplication and synonymy (The problem of Aliases): The product "Bolt M5x20 steel" and "Fasteners met. 5*20" for the inventory analysis system are two different items. The AI cannot give an accurate answer to the question "How many M5 fasteners do we have in stock?" because it does not know that these strings are semantically equal.
2. Outdated hierarchies: The absence of parent-child relationships in the nomenclature reference books. AI will not build a recommendation system if it does not understand that the "iPhone 15" belongs to the "Smartphones" category, and the "iPhone 15 case" is an accessory to the same category, and not a separate type of equipment.
3. Invalid connections: Sales transactions linked to a non-existent customer code or a closed contract.
4. Shadow AI: 74% of ChatGPT accounts in companies are created by employees without the knowledge of the IT department. Employees "feed" corporate data into public models. The data leaks out, generalised answers are returned without reference to the corporate taxonomy, and these answers are uploaded back to the knowledge base. There is information smog.

The traditional approach of IT departments has always been: "Let's buy another server, another database management system, and write another ETL script." But this does not solve the problem of semantics. AI does not need a "perfectly clean" data warehouse. AI needs an "understandable" data warehouse.

Paradigm Shift: From Ideal Conditions to a Business Glossary

Here we come to the key thesis. A business customer often says, "I have dirty data, let's do a general cleaning first, make perfect reference books, and then implement AI." This is a mistake. Demanding "ideal conditions" is a path to nowhere. The data will always be outdated the day after the purge.

Products (AI agents) do not require ideal conditions, but a single communication language. In IT architecture, this language is called Business Glossary.

What is a Business Glossary in the context of AI?

This is not just a file with an abbreviation decryption. This is an active semantic layer that connects physical fields in databases (column names in English/technical language) with business-friendly metrics.

Case study: The difference between "Data" and "Term"

Physical layer: sales_2026 table, cust_id column (technical key), amt column, date column.
Semantic layer (Term): "Revenue from new customers in February."
Term attribute: SUM(amt) WHERE cust_id IN (SELECT id FROM customers WHERE reg_date BETWEEN '02/01/2026' AND '02/28/2026').
Business definition: Money received from customers who registered in February.

The AI agent does not have to guess how to calculate this metric. They should consult the glossary, get a ready-made SQL snippet, and know that this snippet has been approved by the CFO. A well-structured glossary, rather than "perfectly clean" raw data, is the prerequisite for success. The glossary allows the data to remain "noisy" on a physical level, but "slender" on a logical one.

Expert assessment:
The introduction of a generative business glossary is changing roles. Data engineers are no longer "shoe shakers" for data scientists. They become knowledge architects who set up the rules for mapping raw data into business terms.

Order out of Chaos: How AI Treats AI

We found out that a Business glossary is necessary. But creating it manually in a large corporation (tens of thousands of entities) is a process that can take years. And here we see an amazing effect: AI is both the cause of the problem and its solution.

Modern data management platforms (both MDM and specialised semantic layers) use generative AI to autonomously restore order. The process looks like this:

1. Metadata Stack Scanning: The AI agent of the platform "opens" data warehouses. He does not read the purchase lines themselves, but the structure: the names of tables, diagrams, views, developer comments, and most importantly, query logs.
2. Analysis of user behaviour: AI looks at how analysts and business users access data. What kind of joyness do they make? Which fields do they combine most often? If hundreds of analysts combine the contracts table with the clients table by the client_tax_id field in their SQL queries, it means that there is an unspoken but critically important business relationship between these entities.
3. Hypothesis generation of terms: Based on these patterns, the system offers: "I see that the client_tax_id field is often used. Perhaps we should create a business term for the Counterparty's INN. I found 15 spelling options for this field in different databases, I suggest combining them into a single attribute."

Thus, AI doesn't just require cleanliness — it provides it by learning more about how people actually use data, despite its dirtiness.

Case study: The problem of copies and the "Golden Record"

Let's look at a specific, most expensive corporate data problem — the problem of duplicates. It costs companies millions of dollars due to erroneous shipments, double bonus payments, and the inability to compile consolidated financial statements.

How it used to be (The Era of Manual Labor): A team of contractors in India or a regional service center was hired. They opened two databases side by side, visually compared the names "MetallInvest LLC" and "MetallInvest LLC" and manually glued the entries. It's slow, expensive, and subjective.

How it works now (AI + MDM): A modern approach to Master Data Management (MDM) combined with AI completely reverses the process. This is the approach of creating a "Golden Record".

Stage 1: Clusterisation.
Machine learning algorithms (clustering, K-means, etc.) scan millions of records of counterparties or nomenclature. They are not looking for exact matches. They're looking for similarities. The degree of similarity is calculated by hundreds of attributes: address, phone number, mail domain name, INN, checking account.

Stage 2: Matching.
The system builds graphs of connections. She sees that "Petrov Ivan Ivanovich" from the CRM database and "I.I. Petrov" from the accounting database use the same phone number in their contact information. The probability of a match is 98%.

Stage 3: Survivorship.
This is the most important step. AI doesn't just glue records together. It applies quality rules to select the best attribute.

Example: In system A, the counterparty has an old legal address (2005). In system B, there is a new address (2023). The AI, according to the trust policy, takes the address from system B as more recent. Or he takes the INN from the tax reporting system, and the contact email from the CRM, because it is more relevant there.

Result: We do not get an average record, but a reference object assembled according to the "best of the best" principle. The AI agent, responding to a request about a client, receives a link not to three duplicates, but to one "Golden Record".

This solution breaks the knees of traditional ETL processes. Previously, we adjusted the data to fit the schema. Now we are adjusting the scheme (glossary) under reality, and AI is engaged in the "golden" gluing of entities.

Hybrid approach: When privacy becomes a new dimension of quality

When we talk about "dirty data", we usually mean errors. But there is another type of contamination — personal data and sensitive information.

Employees actively upload trade secrets to public neural networks: from strategic presentations to source code. The volume of such downloads has increased 30-fold over the past year. This creates a paradox: a company can clean up its vaults and build an ideal glossary, but if an AI agent quotes an NDA-protected code fragment or a credit card number that accidentally got into the transaction log, the company will receive not just a hallucination, but a lawsuit.

The technological answer is Differential Privacy. Google Research has proposed a solution: AI models can be taught to "forget" specific data while maintaining common patterns. Mathematical "noise" is introduced into the learning process, which does not allow the neural network to memorise exact lines.

For businesses, this means a new requirement for "cleanliness." It is not enough that the data is accurate. They need to be depersonalised at the architectural level. Modern semantic layer systems should be able to replace sensitive attributes on the fly when building datasets for LLM training. This transforms AI platforms from analytics tools into compliance control tools.

Architecture of the future: MDM as a launching pad

Summing up the technological review, we can deduce the formula for preparing for AI. Companies that successfully implement smart agents go through three mandatory stages.

Stage 1: MDM and NSI — "Skeleton".
This is the base. If a company's product can be installed in three different systems with different codes, implementing top-level AI is pointless. MDM class systems (in Russian practice, 1C:MDM NSI Management) assume the function of the "Golden Record". Businesses often perceive MDM as a "boring" infrastructure burden. But this is a double-bottom investment.:

· Without AI: MDM removes errors in procurement and reporting (ROI is obvious). · With AI: MDM becomes the only source of truth for algorithms.

Stage 2: Semantic layer / Glossary — "Muscles".
MDM knows that "Object A" and "Object B" are the same thing. The glossary explains to the AI what to do with this object. It translates business metrics ("Order Profitability") into machine code. Without this layer, AI is a child with excellent memory, but without understanding the context.

Stage 3: Orchestration and Security (AI Governance) — "Skin".
This is a security loop that prevents data from leaking, and agents from hallucinating beyond the perimeter of trust. It includes API interaction control and injection protection.

A look into 2026: The market of maturity

At the beginning of 2026, we see a clear market stratification. The "blind" investments in AI are over. The era of Data-Centric AI, data—driven AI, is beginning.

We are witnessing the abandonment of illusions. Stories go back to the past when the CEO demanded to "just attach ChatGPT" to the corporate portal, waiting for magic. It became clear that the large language model (LLM) is, in fact, a very expensive compiler. And the compiler needs high-quality source code (data). If the code is shit, the compiler will give out shit, just very quickly.

Expert forecasts:

1. Budgets are migrating. By 2027, budgets for Data Quality and MDM will overtake budgets for the purchase of LLM. Companies will understand that fine-tuning a model based on their data is effective only when this data is combed and unified.
2. The growth of semantic markup platforms. Solutions such as illumex, Alation, Collibra and their Russian counterparts will become a mandatory layer of corporate architecture, just like a firewall.
3. Avoiding "heroic" programming. Data scientists will stop spending 80% of their time cleaning data (this sad statistic roams from year to year). Automated metadata management pipelines will take over this work.

Order as a strategy

The disappointment of large organisations in "smart agents" is a useful disappointment. It sobers the market and brings us back to the basics. We stopped believing in magic and started believing in engineering.

Dirty data is not just a technical flaw. This is a reflection of the immaturity of business processes. It is impossible to build a digital twin of a company if the company itself does not know how it works: there is no single definition of "active customer", there is no single directory of branches, there is no product naming policy.

The only way to achieve AGI (general artificial intelligence) on an enterprise scale is not through increasing computing power, but through restoring semantic order. The technologies of generative business glossary and intelligent master data management are not just IT projects. These are corporate consciousness engineering projects.

While some are chasing AGI "hallucinations", smart companies are treating the "hallucinations" of their data. And they will become the masters of the market in 2030.

Meta* — recognised as an extremist organisation in the Russian Federation