<< Back to all Blogs
Reduce that Snowflake spend!

Reduce that Snowflake spend!

Maciej Tarsa

Snowflake is a powerful data warehousing platform that makes it incredibly easy to onboard new workloads and data sources. In the early stages, the focus is often on rapid development and feature expansion—not cost. But as usage scales, so do expenses, and without proactive management, organizations can find themselves facing significantly higher bills than anticipated.

There are numerous posts out there with steps to optimise your Snowflake costs, hence we won’t be repeating them here. Today we’re presenting some of the more interesting aspects of Snowflake cost optimisations that we used and came across. That being said - don’t forget to reduce that auto-suspend!

Right size compute

General purpose warehouses are handy - you can assign the same warehouse to all users and they can take advantage of query co-location. But soon, some workloads will require more compute and you may end up increasing the size of your general purpose warehouse. Now, even the tiniest of queries will potentially cost twice as much or more. It’s a double edged sword. The question of consolidating your warehouses into a single one versus splitting one into multiples is never a final decision - you should be regularly revisiting it. And the best solution is probably a mix of both.

For example, if you have multiple workloads of similar complexity running at the same time, but hitting different warehouses, it may be a good idea to consolidate these into a single warehouse. However, if you have many different workloads (especially if some are light queries and some heavier queries) hitting the same warehouse, it may be worth considering splitting it into multiple warehouses. Light queries can run on the smallest warehouse while heavier ones on bigger warehouses. The benefit here is that a light query will not spin up an expensive warehouse.

Reduce query predicates

Snowflake comes out of the box with some effective caching mechanism. For example, query cache will return the same result if Snowflake received exactly the same query, the underlying data hasn’t changed and the results were last retrieved within the last 24 hours. When query cache is retrieved - no virtual warehouse is activated and hence no warehousing costs are accrued - often the most expensive part of using Snowflake.

Now imagine that you have a service that queries Snowflake for some aggregated data and hence includes a query predicate like timestamp <= CURRENT_DATE(). We will also have regular ingests of fresh data into the table being queried. Seems like the right solution, each query should be receiving the most up to date data. But what if the service querying Snowflake doesn’t need today’s data. In one of our projects we came across an example where only the data from up to the previous day was being used. As a result, the query cache was not being utilised (underlying data changed), and each query was spinning up the virtual warehouse to execute that query. Reducing the predicate to yesterday’s date would have greatly increased the usage of query caching in this instance.

Reduce frequency of data ingests

We live in a world where everyone wants to have fresh data as soon as possible. But that’s not necessarily what data warehousing is all about. It may be worth reconsidering how often new data is ingested into your system and whether you actually need it that often. As usual, this may be a give and take - if you ingest data less often, your individual compute operations will take longer. If you’re thinking of reducing ingest frequency a good rule of thumb would be to check how long the current ingests are taking - if it’s less than the minimum 60 seconds you’ll be charged for - it may be worth reducing that frequency. Serverless tasks can be a saviour here as well. If your individual tasks are taking less than 60 seconds and you are not co-locating queries on the same warehouse, serverless tasks can be a great solution here. Especially as they currently cost 0.9x of the regular compute!

Reduce data scanned

You may be using some pre-set views that load many columns but only some are actually used in your heavy hitting queries. Your original view may have only included a couple of columns, but then another requirement resulted in other columns being added. In the meantime, your original workload is still using the same columns, but now scans a lot more data to produce the same result.

This was the case in one of the environments we looked at recently. Snowflake was receiving a large number of queries which used a view containing 20 different columns. However, only 4 of those columns were actually being used. After some experimentation, we reduced bytes scanned per query from an average of 300MB to 80MB and the total execution time from an average of 800ms to 300ms. While overall the compute cost remained the same most of the time (because of the minimum 60 seconds charge), this reduced the load on that warehouse and allowed to free up resources for additional workloads without increasing compute size and costs.

Inefficient queries

Sometimes you won’t be able to control all the queries coming into your Snowflake environment. You’ll have other teams sending in inefficient and unoptimised queries and these may make up a big bulk of your Snowflake spending. Suggest some changes to the teams running those queries - but also give them a hand. Starting with a message suggesting they try and optimise those queries would be great - but if that doesn’t get you anywhere - work with them. They may not have the knowledge or resources to make that change. Another good idea would be to create some views for their workloads - with that you control at least part of the execution of those heavy workloads. It’s all about communication.

Are you using Snowflake for the correct thing?

Snowflake excels as a data warehousing solution, but can get pretty expensive when used for more operational workloads which involve a lot of reads and writes. We’ve seen organisations use Snowflake as their API backend, which resulted in queries from the API making up about half of their Snowflake compute costs. While a setup like this may be easy and quick to achieve - data is in Snowflake already and your API layer can easily connect to it - it may not be the best use of Snowflake. Each query coming from the API would spin up your warehouse if it’s not up already (unless query cache can be utilised). Depending on how many requests you are receiving, this may lead to your warehouse being up most of the time.

This is where other technologies may be more suitable. One solution we would suggest is looking at Iceberg Tables. If you save the table being queried by the API into external Iceberg table format, other query engines can query it - this may be as simple as querying it directly from an AWS Lambda. Other potential solutions are; exporting data from Snowflake into another database for operational querying or saving pre-computed results to an S3 bucket. This is a nice segue to our next point…

Talk to your teams and look beyond Snowflake

Snowflake is just one part of the puzzle. Time and time again we see teams working in silos, often making assumptions about what the requirements are or creating whole solutions because of a single request. Cooperation with other teams can often clarify what is actually needed - align what your team is working on with what others are planning or implementing already - and even avoid duplication of efforts. For example, if Snowflake’s baked-in caching can be utilised, perhaps there is no need for the service reading data from Snowflake to implement their own caching solution. It’s also important to learn how—and when—to say ‘no.’ Sure, that feature someone mentioned might be cool and useful to some users, but does it align with the strategic goals of your platform?

Other things to consider

And just for completeness, here’s a list of other commonly used ways to optimise your Snowflake costs:

  • Reduce warehouse auto-suspend
  • Reduce warehouse size
  • Set minimum clusters to 1
  • Ensure correct table clustering
  • Drop unused tables
  • Lower data retention
  • Use transient tables
  • Enable query timeouts
  • ...

Takeaways

This list goes on. There are many ways of optimising Snowflake, but these are some of the more interesting aspects we came across. Perhaps in a few months we’ll have enough examples to write another piece.

If you want to chat about or get help with optimising or implementing Snowflake for your data stack, please reach out! We also work with numerous other technologies and love creating solutions that are fit for purpose for our clients, so don’t be shy.