Meesho Talk to Data: Achieving 100% Effective Accuracy in BigQuery Conversational Analytics

May 29, 2026

Project Overview

Meesho partnered with Wohlig Transformations to build a BigQuery Conversational Analytics agent for its Fulfilment & Experience (FnE) team, achieving 100% effective accuracy on 40 held-out evaluation prompts with every result manually verified row-by-row.

The Fulfilment & Experience (FnE) team at Meesho manages end-to-end operations for India’s largest social-commerce platform using a 10-table mercury dataset in BigQuery, comprising more than 224 columns and 64 defined business metrics.

Wohlig developed a BigQuery Conversational Analytics agent, published as FnE_v2_14Apr in BigQuery Studio’s Agent Catalog, enabling business users to ask natural-language questions and receive SQL queries, BigQuery jobs, result tables, auto-generated visualizations, plain-English insights, and knowledge-source references.

The engagement was executed in two phases. Round 1 achieved 89% effective accuracy across 57 evaluation queries, while Round 2 achieved 100% effective accuracy across 40 held-out evaluation prompts after a dataset quality improvement initiative and complete re-engineering of the knowledge base.

The Challenge

SQL Dependency Across Business Operations

The mercury dataset contained 10 tables, 64 metrics, and 68 business rules. Operational questions around refund clearance, LDR breaches, RTO percentages, NPS, and dispatch performance required analysts to manually write SQL.

Dataset Quality Constraints

The initial implementation reached 89% effective accuracy. Investigation showed that the remaining performance gap stemmed primarily from inconsistencies within the underlying dataset rather than limitations of the conversational analytics agent.

Evaluation Trustworthiness

Enterprise-grade analytics systems cannot rely on LLM-generated evaluation. Every output needed verification at the data level through direct comparison against trusted reference outputs.

Spark-to-BigQuery Conversion Challenges

During migration and reconciliation, 11 Spark normalization issues were discovered and corrected before outputs could be trusted as ground truth.

Overfitting Risk

A critical requirement was ensuring that evaluation prompts remained completely separate from training and verification datasets to avoid artificially inflated results.

Key Objectives

Enable natural-language access to operational data without requiring SQL expertise.
Validate outputs through row-by-row and cell-by-cell verification.
Maintain complete separation between verified training queries and evaluation prompts.
Ground responses in schema definitions, glossary terms, joins, and business rules.
Create a repeatable methodology capable of scaling across future datasets.

The Solution: Two-Round Conversational Analytics Program

Round 1: Initial Dataset Evaluation

The first version of the agent was evaluated against 57 reference queries and achieved 89% effective accuracy, consisting of:

21 MATCH
30 ACCEPTABLE
6 genuine failures

Analysis revealed that remaining inaccuracies were driven by dataset quality issues rather than agent logic.

Round 2: FnE Cleaned Dataset

After Meesho engineered a cleaned version of the dataset, Wohlig rebuilt the schema understanding, instructions, glossary, and verified query set.

The updated agent achieved:

31 MATCH (77.5%)
5 NEAR_MATCH (12.5%)
4 ACCEPTABLE (10%)
0 NOT_MATCH

Resulting in 100% effective accuracy across all 40 held-out evaluation prompts.

Schema and Knowledge Foundation

The solution incorporated:

10 table descriptions
8 join relationships
88 glossary terms
225 documented columns
20+ critical field recommendations

The glossary included detailed formula definitions and denominator disambiguation to eliminate ambiguity in business metrics.

Structured Instruction Framework

The agent was governed by:

83 MUST/NEVER rules
21 instruction categories
14 BAD/GOOD examples
Mandatory filters
Bucket-metric and date-anchor mappings

This transformed loosely defined business rules into deterministic operational behavior.

Verified Query Set

A set of 73 verified queries was created:

57 queries from the MIA Metric List
16 supplementary patterns

The verified query set maintained zero overlap with evaluation prompts.

Three-Way Validation Framework

Outputs were validated across:

Spark on Dataproc ↔ BigQuery Conversion ↔ Agent SQL

This process achieved:

97/97 functional parity
40/40 evaluation queries validated
57/57 golden queries validated

Technology Stack

BigQuery Conversational Analytics API
BigQuery Studio (Agents Preview)
Dataplex
Dataproc
BigQuery Mercury Dataset
Gemini

Key Benefits & Results

Reliable Evaluation

Previous: LLM-as-judge evaluation.

Solution: Cell-by-cell manual verification.

Result: Every evaluation verdict was based on actual output data.

Accuracy Improvement

Previous: 89% effective accuracy.

Solution: Cleaned dataset and rebuilt knowledge framework.

Result: 100% effective accuracy on 40 held-out evaluation prompts.

Reference Consistency

Previous: Spark and BigQuery output drift.

Solution: Three-way validation methodology.

Result: 97/97 functional parity and 11 normalization issues identified.

Reduced Hallucinations

Previous: Risk of invented fields and incomplete logic.

Solution: 88-term glossary, 83 instruction rules, and worked examples.

Result: Zero genuine logic failures.

Evaluation Integrity

Previous: Risk of training/evaluation overlap.

Solution: Strict separation between verification and evaluation datasets.

Result: Production-safe and auditable evaluation process.

Improved User Experience

Previous: Manual SQL creation for every business question.

Solution: Conversational Analytics Agent in BigQuery Studio.

Result: SQL generation, execution, visualization, insights, and follow-up recommendations delivered through a single interface.

Technical Innovation

Three-Way Validation Methodology

Every output was validated through direct reconciliation across Spark, BigQuery, and generated SQL before the agent was evaluated.

Manual Row-Level Verification

No LLM-as-judge approach was used. Outputs were evaluated through direct row, column, and cell comparisons. Every NEAR_MATCH and ACCEPTABLE verdict was documented with written justification.

Deterministic Operating Framework

The solution incorporated:

88 glossary terms
83 MUST/NEVER rules
14 worked examples
Fixed data windows
1.5 TB query scan limits

This established a predictable and controlled operational environment.

Repeatable Methodology

The same methodology that identified dataset limitations in Round 1 enabled 100% effective accuracy after data improvements in Round 2, demonstrating repeatability across evolving datasets.

Wohlig’s Approach

Collected and audited six input artifacts including tables, metrics, business rules, dimensions, and evaluation prompts.
Performed Spark-to-BigQuery conversion with detailed parity validation and correction of normalization issues.
Developed schema documentation, glossary definitions, join mappings, and column descriptions.
Created structured instruction frameworks, business logic rules, and worked examples.
Built a verified query repository with complete separation from evaluation datasets.
Trained, evaluated, and published the final agent in BigQuery Studio while delivering full evaluation reports to Meesho and Google for independent review.

About Meesho

Meesho is India’s leading social-commerce platform, enabling millions of small businesses to sell online. Its Fulfilment & Experience (FnE) organization manages the end-to-end customer journey from order placement through delivery, claims, returns, customer support, and post-sale operations.

About Wohlig Transformations Pvt. Ltd.

Founded in 2015, Wohlig Transformations specializes in GenAI and DevOps solutions, with more than 160 professionals serving clients across India and the United Kingdom

.Detailed Case Study : https://youtu.be/OuRF7hlSNNc

Comments

Ready for more?