Блог/ru
2020-08-02 19:56

Enabling the data-driven organisation with Text2SQL

As our world is getting more global and dynamic, businesses are more and more dependent on data for making informed, objective and timely decisions. However, as of now, unleashing the full potential of organisational data is often a privilege of a handful of data scientists and analysts. Most employees don’t master the conventional data science toolkit (SQL, Python, R etc.). To access the desired data, they go via an additional layer where analysts or BI teams “translate” the prose of business questions into the language of data. The potential for friction and inefficiency on this journey is high — for example, the data might be delivered with delays or even when the question has already become obsolete. Information might get lost along the way when the requirements are not accurately translated into analytical queries. Besides, generating high-quality insights requires an iterative approach which is discouraged with every additional step in the loop. On the other side, these ad-hoc interactions create disruption for expensive data talent and distract them from more strategic data work, as described in these “confessions” of a data scientist:

"When I was at Square and the team was smaller we had a dreaded “analytics on-call” rotation. It was strictly rotated on a weekly basis, and if it was your turn up you knew you would get very little “real” work done that week and spend most of your time fielding ad-hoc questions from the various product and operations teams at the company (SQL monkeying, we called it). There was cutthroat competition for manager roles on the analytics team and I think this was entirely the result of managers being exempted from this rotation — no status prize could rival the carrot of not doing on-call work."[1]

Indeed, wouldn’t it be cool to talk directly to your data instead of having to go through multiple rounds of interaction with your data staff? This vision is embraced by conversational interfaces which allow humans to interact with data using language, our most intuitive and universal channel of communication. After parsing a question, an algorithm encodes it into a structured logical form in the query language of choice, such as SQL. Thus, non-technical users can chat with their data and quickly get their hands on specific, relevant and timely information, without making the detour via a BI team. In this article, we will consider the different implementation aspects of Text2SQL and focus on modern approaches with the use of Large Language Models (LLMs), which achieve the best performance as of now (cf. [2]; for a survey over alternative approaches beyond LLMs, readers are referred to [3]). The article is structured according to the following “mental model” of the main elements to consider when planning and building an AI feature: