Expand my Community achievements bar.

Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller

Avatar

Level 4

10/12/24

Visual Studio Code (VS Code) is one of the most popular, lightweight, and versatile code editors available today. Developed by Microsoft, it’s designed to work with a wide range of programming languages and tools, offering support for extensions that make it even more powerful. When paired with SQL, VS Code becomes a powerful environment for querying databases, managing data, and writing efficient SQL scripts.

Why Visual Studio Code for SQL?

  • Download JSON: DBVisualizer and many other tools charge for downloading JSON data as part of their premium offerings, which can be a significant limitation, especially when working with Data Distiller. This editor, however, provides much more flexibility without those restrictions.
  • Lightweight and Fast: VS Code is designed to be lightweight, allowing you to run it efficiently on various systems, including machines with limited resources. Despite its small footprint, it packs powerful features.
  • Run Python Notebooks: Visual Studio Code supports Jupyter-style Python notebooks through the Jupyter extension. This extension allows you to create, edit, and run Jupyter notebooks (.ipynb files) directly within the VS Code interface, offering a similar experience to JupyterLab. Here’s what it provides:
  • IntelliSense for SQL: With SQL extensions installed, you get features like IntelliSense (auto-completion, syntax highlighting, and error checking), making writing complex SQL queries faster and less error-prone.
  • Integration with GitHub Copilot: With GitHub Copilot integrated into VS Code, SQL developers can benefit from AI-powered code suggestions while writing queries. Copilot can assist by auto-completing SQL statements, suggesting optimized queries, or even generating complex SQL code based on comments or partial queries, dramatically speeding up query development and reducing errors. This is particularly useful for beginners or when working with complex SQL joins, aggregations, and subqueries.

Enhance Secure Access of Visual Studio Code with IP Whitelisting in Data Distiller

Since this client is installed on your local machine, if your administrator has concerns about accessing data in AEP from a different machine, you should request that they enable the Data Distiller whitelisting feature. This ensures that only IP addresses from the corporate network are allowed access. Even if someone attempts to spoof an IP and send a query, the responses would still be restricted to the corporate network, preventing unauthorized access.

This feature was released in Data Distiller in October 2024. A blog and a tutorial will come soon.

Coding Assistants Boost Productivity of Power Users

If you're someone who isn't interested in understanding the SQL being generated, there are clear limitations to what coding assistants—or any assistants—can do for you. Without grasping the nuances, you'll likely struggle with suggestions that may be incorrect or overly complex, making the process more difficult. This is why conversations about developer productivity assume that developers have the expertise to use these assistants to automate parts of the code they already understand. These tools are not a substitute for a skilled SQL developer.

 

The same principle applies to any AI assistants or task agents. If you don't understand what's happening and why, there's a high risk that a simple mistake could lead to system-wide issues. AI assistants can be very convincing, but they can also provide incorrect answers without you realizing it. This highlights a broader challenge in communication with AI—if you're not fact-checking or fully understanding the context, you could find yourself in serious trouble.

Managing Expectations with Coding Assistants

When working with coding assistants like GitHub Copilot, it's crucial to manage expectations, especially when dealing with complex coding tasks. While Copilot excels at speeding up coding workflows and providing useful suggestions for standard operations, it encounters limitations in handling deeply nested data structures, such as arrays and maps. These structures, which are common in Data Distiller queries and data models, often require a nuanced understanding of context and relationships that current language models struggle to fully grasp. In our experiments with SQL, especially in scenarios involving complex subqueries, joins, and deeply nested data, we've found that Copilot may struggle to generate accurate or optimal code.

 

Retrieval-Augmented Generation (RAG) is a technique that can help alleviate some of these limitations. In a RAG-based system, the model augments its generation by retrieving relevant information from a knowledge base or external documents. This approach can improve accuracy in tasks like querying complex datasets because it combines generated content with factual information retrieval, making the model more context-aware.

However, there are trade-offs with using RAG-based approaches:

  • Restrictiveness: While RAG can help improve the precision of the generated SQL queries by providing more context, it also makes the query generation process more restrictive. The model's output becomes heavily dependent on the retrieved information, meaning it is less likely to generate creative or flexible queries. This can be beneficial when accuracy is critical, but it can also limit the flexibility needed for more exploratory or dynamic queries.
  • Dependency on Knowledge Base: RAG systems rely on a well-curated knowledge base or set of documents for retrieval. If the knowledge base is incomplete, outdated, or lacks detail on specific database schemas, the quality of the suggestions can be limited. This can still result in gaps when dealing with custom or less-documented data models.
  • Performance: RAG-based models require additional steps to retrieve and process information, which will increase the latency of generating suggestions or you have to put in more resources. This might not be an issue for smaller tasks, but for larger, complex queries, it could impact the overall efficiency and cost.

How GitHub Copilot Works in Visual Studio Code

When you install GitHub Copilot extension in Visual Studio Code, it integrates into your development environment, offering code suggestions as you type. Here's how it works:

Access & Pricing

Check this page

GitHub Copilot isn't free, but it's highly affordable considering the significant time it can save during development. What is cool is that you can get a free trial for a month to try and see if it meets your needs.

Privacy Considerations

You can read about it here.

 

GitHub Copilot takes user privacy and security seriously. Here's how your data is handled when using the tool based on the information available

  • Contextual Suggestions Only: Copilot uses the code in your current file and project to make context-aware suggestions. However, it does not access or share the broader content of your private repositories or any code outside the active session to provide its suggestions.
  • No Use of Private Code for Training: While GitHub Copilot was trained on public code repositories, it does not use your private repository data for training its underlying model. Private code remains private, and Copilot only leverages the context within your active session for generating recommendations.
  • Data Collection for Feedback: Copilot collects limited data, such as whether you accept or reject a suggestion, to help improve its performance. However, it does not collect or store the specific code you're working on unless you explicitly share it through feedback mechanisms.

Try the Tutorial

The link is here.