Meesho Talk to Data: Achieving 100% Effective Accuracy in BigQuery Conversational Analytics
Project Overview
Meesho partnered with Wohlig Transformations to build a BigQuery Conversational Analytics agent for its Fulfilment & Experience (FnE) team, achieving 100% effective accuracy on 40 held-out evaluation prompts with every result manually verified row-by-row.
The Fulfilment & Experience (FnE) team at Meesho manages end-to-end operations for India’s largest social-commerce platform using a 10-table mercury dataset in BigQuery, comprising more than 224 columns and 64 defined business metrics.
Wohlig developed a BigQuery Conversational Analytics agent, published as FnE_v2_14Apr in BigQuery Studio’s Agent Catalog, enabling business users to ask natural-language questions and receive SQL queries, BigQuery jobs, result tables, auto-generated visualizations, plain-English insights, and knowledge-source references.
The engagement was executed in two phases. Round 1 achieved 89% effective accuracy across 57 evaluation queries, while Round 2 achieved 100% effective accuracy across 40 held-out evaluation prompts after a dataset quality improvement initiative and complete re-engineering of the knowledge base.
The Challenge
SQL Dependency Across Business Operations
The mercury dataset contained 10 tables, 64 metrics, and 68 business rules. Operational questions around refund clearance, LDR breaches, RTO percentages, NPS, and dispatch performance required analysts to manually write SQL.
Dataset Quality Constraints
The initial implementation reached 89% effective accuracy. Investigation showed that the remaining performance gap stemmed primarily from inconsistencies within the underlying dataset rather than limitations of the conversational analytics agent.
Evaluation Trustworthiness
Enterprise-grade analytics systems cannot rely on LLM-generated evaluation. Every output needed verification at the data level through direct comparison against trusted reference outputs.
Spark-to-BigQuery Conversion Challenges
During migration and reconciliation, 11 Spark normalization issues were discovered and corrected before outputs could be trusted as ground truth.
Overfitting Risk
A critical requirement was ensuring that evaluation prompts remained completely separate from training and verification datasets to avoid artificially inflated results.
Key Objectives
Enable natural-language access to operational data without requiring SQL expertise.
Validate outputs through row-by-row and cell-by-cell verification.
Maintain complete separation between verified training queries and evaluation prompts.
Ground responses in schema definitions, glossary terms, joins, and business rules.
Create a repeatable methodology capable of scaling across future datasets.
The Solution: Two-Round Conversational Analytics Program
Round 1: Initial Dataset Evaluation
The first version of the agent was evaluated against 57 reference queries and achieved 89% effective accuracy, consisting of:
21 MATCH
30 ACCEPTABLE
6 genuine failures
Analysis revealed that remaining inaccuracies were driven by dataset quality issues rather than agent logic.
Round 2: FnE Cleaned Dataset
After Meesho engineered a cleaned version of the dataset, Wohlig rebuilt the schema understanding, instructions, glossary, and verified query set.
The updated agent achieved:
31 MATCH (77.5%)
5 NEAR_MATCH (12.5%)
4 ACCEPTABLE (10%)
0 NOT_MATCH
Resulting in 100% effective accuracy across all 40 held-out evaluation prompts.
Schema and Knowledge Foundation
The solution incorporated:
10 table descriptions
8 join relationships
88 glossary terms
225 documented columns
20+ critical field recommendations
The glossary included detailed formula definitions and denominator disambiguation to eliminate ambiguity in business metrics.
Structured Instruction Framework
The agent was governed by:
83 MUST/NEVER rules
21 instruction categories
14 BAD/GOOD examples
Mandatory filters
Bucket-metric and date-anchor mappings
This transformed loosely defined business rules into deterministic operational behavior.
Verified Query Set
A set of 73 verified queries was created:
57 queries from the MIA Metric List
16 supplementary patterns
The verified query set maintained zero overlap with evaluation prompts.
Three-Way Validation Framework
Outputs were validated across:
Spark on Dataproc ↔ BigQuery Conversion ↔ Agent SQL
This process achieved:
97/97 functional parity
40/40 evaluation queries validated
57/57 golden queries validated
Technology Stack
BigQuery Conversational Analytics API
BigQuery Studio (Agents Preview)
Dataplex
Dataproc
BigQuery Mercury Dataset
Gemini
Key Benefits & Results
Reliable Evaluation
Previous: LLM-as-judge evaluation.
Solution: Cell-by-cell manual verification.
Result: Every evaluation verdict was based on actual output data.
Accuracy Improvement
Previous: 89% effective accuracy.
Solution: Cleaned dataset and rebuilt knowledge framework.
Result: 100% effective accuracy on 40 held-out evaluation prompts.
Reference Consistency
Previous: Spark and BigQuery output drift.
Solution: Three-way validation methodology.
Result: 97/97 functional parity and 11 normalization issues identified.
Reduced Hallucinations
Previous: Risk of invented fields and incomplete logic.
Solution: 88-term glossary, 83 instruction rules, and worked examples.
Result: Zero genuine logic failures.
Evaluation Integrity
Previous: Risk of training/evaluation overlap.
Solution: Strict separation between verification and evaluation datasets.
Result: Production-safe and auditable evaluation process.
Improved User Experience
Previous: Manual SQL creation for every business question.
Solution: Conversational Analytics Agent in BigQuery Studio.
Result: SQL generation, execution, visualization, insights, and follow-up recommendations delivered through a single interface.
Technical Innovation
Three-Way Validation Methodology
Every output was validated through direct reconciliation across Spark, BigQuery, and generated SQL before the agent was evaluated.
Manual Row-Level Verification
No LLM-as-judge approach was used. Outputs were evaluated through direct row, column, and cell comparisons. Every NEAR_MATCH and ACCEPTABLE verdict was documented with written justification.
Deterministic Operating Framework
The solution incorporated:
88 glossary terms
83 MUST/NEVER rules
14 worked examples
Fixed data windows
1.5 TB query scan limits
This established a predictable and controlled operational environment.
Repeatable Methodology
The same methodology that identified dataset limitations in Round 1 enabled 100% effective accuracy after data improvements in Round 2, demonstrating repeatability across evolving datasets.
Wohlig’s Approach
Collected and audited six input artifacts including tables, metrics, business rules, dimensions, and evaluation prompts.
Performed Spark-to-BigQuery conversion with detailed parity validation and correction of normalization issues.
Developed schema documentation, glossary definitions, join mappings, and column descriptions.
Created structured instruction frameworks, business logic rules, and worked examples.
Built a verified query repository with complete separation from evaluation datasets.
Trained, evaluated, and published the final agent in BigQuery Studio while delivering full evaluation reports to Meesho and Google for independent review.
About Meesho
Meesho is India’s leading social-commerce platform, enabling millions of small businesses to sell online. Its Fulfilment & Experience (FnE) organization manages the end-to-end customer journey from order placement through delivery, claims, returns, customer support, and post-sale operations.
About Wohlig Transformations Pvt. Ltd.
Founded in 2015, Wohlig Transformations specializes in GenAI and DevOps solutions, with more than 160 professionals serving clients across India and the United Kingdom
.


