What is dlt?
These were my initial questions when I first encountered the dlt (data load tool) package (https://dlthub.com/) several months ago. The dlt Python library facilitates data loading from sources to destinations. It supports several trusted sources like Google Analytics and predefined destinations such as BigQuery. In this respect, the library is yet another solution addressing the data integration challenges that SaaS companies like Fivetran, Stitch, or Airbyte are also trying to solve.
On the one hand, I found dlt interesting; however, I initially believed that major players like Fivetran or Airbyte already covered the majority of use cases for many companies. As a result, I left the dlt documentation and didn't experiment with it for several months.
Then, I needed to build a stock market pipeline from an API endpoint not covered by tools like Fivetran. I remembered the dlt package, revisited it, and built an MVP of the pipeline within an hour. I was stunned by how quickly I achieved this! That was my eureka moment with dlt.
So, why do you need it?
First, dlt is a super lightweight library that is easy to get started with. It is simple to understand, allowing you to begin quickly. The main components of the library are pipelines that consist of sources and destinations (often databases). Dlt retrieves data from the source endpoint and stores it in the destinations with some metadata. One key feature is the schema inference capability of the library. It attempts to generate a database schema from your JSON API call and saves the data accordingly. This is particularly powerful for prototyping. Additionally, you can provide your own schema to the pipeline if needed.
Whenever I need to pull data from an API, I now try to solve the task using dlt. It offers a straightforward and efficient way to structure the code, enabling rapid deployment.
In upcoming articles, I will explore some sample projects with dlt and discuss the interplay between dlt, dbt, and DuckDB.