Migrating from one database system to another is typically a significant undertaking. There needs to be a compelling reason to justify the substantial effort, cost, and risk involved in such a migration.
The most challenging scenario is migrating a highly optimized, computationally intensive workload that operates against large volumes of data. In these cases, the migration is particularly complex and fraught with difficulties. The effort required, potential disruptions, and overall risks are amplified when dealing with mission-critical, high-performance database environments.
Businesses considering migrating to Snowflake must carefully plan and execute the process meticulously. In this article, we talk about the intricacies of Snowflake migration, as well as the best practices that should help you navigate this major transition with ease.
Why migrate to Snowflake? 6 reasons
Snowflake Data Cloud offers multiple advantages for businesses that want to take their data-to-insight capabilities to the next level. Migrating to Snowflake will be particularly beneficial in the following scenarios:
Your system performance struggles
As data volumes and complexity increase, legacy data platforms struggle to process complex queries effectively. Poor system performance is often an early indicator that it's time to migrate to Snowflake, or modernize your data infrastructure by switching to a modern data platform.
Snowflake can automate resource optimization, tune queries, and facilitate easier maintenance compared to legacy data platforms. With each platform update, Snowflake continues to improve efficiency and accelerate performance for its clients. For example, Snowflake has decreased query compilation time by 16% since June 2019.
Snowflake's automated resource optimization allows it to dynamically adjust computing resources to handle increasing data volumes and complex queries. Its automated query tuning optimizes queries for maximum performance, removing the need for manual tuning. Snowflake's cloud-based architecture also simplifies maintenance compared to on-premises legacy systems.
You waste hours on analytics
Legacy data infrastructures coupled with poor data quality, suboptimal data storage and inefficient pipelines create bottlenecks that may hamper your business intelligence and analytics efforts. With today’s technology and existing analytics capabilities, this should not take longer than minutes.
Snowflake’s scalable computing power, separation of storage and compute, as well as the columnar data storage format and support for parallel processing can significantly improve query performance, especially for analytical workloads that involve.
By leveraging all of those capabilities, we’ve built a Snowflake-powered modern data platform that helped one of our clients, Trans.eu, cut reporting time from 14 hours to just a few minutes. Read the full case study to see how it was done.
You find it hard to scale
Migrating to Snowflake is an effective way to address the scalability challenges of existing data warehousing solutions. If your current data warehouse struggles to handle increasing data volumes and concurrent user queries, Snowflake can provide the necessary scalability.
One key differentiator is Snowflake's automatic concurrency scaling feature. This algorithm-driven system can scale up or down computing resources automatically to handle fluctuating workloads, without manual intervention. According to Snowflake's research, this feature can improve overall performance for concurrent queries by up to 84%.
Your data warehousing costs skyrocket
Another major reason to migrate to Snowflake is cost optimization. Unlike storing on-premise or on traditional data warehouse solutions, migrating to Snowflake's data cloud allows you to pay only for the utilized resources. Using fully managed solutions like this one helps businesses minimize their total cost of ownership by eliminating software license fees, infrastructure management, and maintenance spending.
Snowflake has brought cost optimization beyond just cutting infrastructure costs – they provide clients with native cost and workload optimization features. Snowflake's cost management framework consists of three parts: visibility, control, and optimization. Each attribute represents a set of features and best practices that help you explore current data warehouse costs, enforce cost control and query limits, and discover resources that require fine-tuning.
You infrastructure isn’t secure and doesn’t support data sharing
Secure data sharing is caring not only for your customers and partners but also for your business reputation. If your current infrastructure has limited data sharing and collaboration opportunities, Snowflake migration is an optimal solution. First of all, the platform protects your information–while sharing database objects with other Snowflake accounts, the actual data isn't transferred or copied. Instead, the sharing occurs in the services layer and metadata store.
Moreover, all the objects you share are read-only, meaning that your data consumers won't be able to modify or delete them. To guarantee that your shared data is safe, partner with a tech provider that has strong cybersecurity policies in place.
Sharing data with other accounts is also seamless and quick–access to shared data is granted instantly and can be instantly rewoken as well. For businesses, prompt data sharing translates to effective cooperation and informed and timely decision-making.
You lack real-time analytics
Processing data right when it becomes available enables businesses to analyze customer behavior, detect fraud, and improve operational efficiency. If your existing system can't handle the large throughput, we recommend migration to Snowflake. The data warehouse platform includes a built-in Kafka connector designed to streamline the uninterrupted data pipeline from raw data sources to visualization dashboards. Unlike layered platforms, the platform's multi-cluster architecture allows for handling high-velocity data streams at once. As a result, businesses gain real-time insights into their data presented on Tableau dashboards.
Some of the multiple benefits of Snowflake.
Snowflake migration challenges
Here are some of the most common challenges you’ll have to face when migrating to Snowflake:
Adjusting the information architecture
The migration to Snowflake poses challenges in setting up the information architecture due to the need to align existing data structures, types, and sources with Snowflake's multi-cluster, multi-tier architecture. Additionally, migrating from legacy systems often requires reconciling differences in schema designs, handling varying data formats, and addressing data quality issues.
Defining optimized data storage, data access controls, and leveraging Snowflake's features for performance enhancements requires careful planning. Ensuring a seamless transition while accommodating business requirements, security, and performance optimization further complicates the information architecture setup during the migration.
Migrating historical data
Legacy system architecture, connectivity, and data quality issues can bottleneck efficient extraction and transfer of data, not to mention the need to optimize compression, resource handling and validating large datasets.
Migrating historical data to Snowflake involves extracting it from legacy systems, transforming it to align with Snowflake's architecture, and loading it into the new data platform. Complexities arise from differences in schema designs, data types, and storage mechanisms between legacy and Snowflake. Managing large datasets, optimizing for Snowflake's cloud features, and addressing data quality issues are key challenges.
Using Snowflake's PUT and COPY commands allows for a seamless and scalable migration, but potential bottlenecks include data formatting and staging, concurrency and parallel loading, file size and compression, and error handling and logging.
Tools like Fivetran, Qlik Replicate, and Matillion offer ease of use and low-code/no-code solutions. These tools provide features such as connectivity to more sources, job scheduling, orchestration, and full automation. For more information, consult this article on migrating historical data to Snowflake.
Migrating the pipelines and code
Manually converting existing data pipelines and code to work in Snowflake can be very challenging. Differences in data processing paradigms, query languages, and overall system architecture between legacy systems and Snowflake make the migration complex. Legacy systems often use their own specialized languages (such as netezzaSQL or TeradataSQL) that require significant effort to adapt for Snowflake.
Instead of manual conversion, consider using specialized software to automate the translation of queries from legacy SQL dialects to Snowflake's dialect.
Adapting ETL processes to ELT
Snowflake's architecture is optimized for ELT (Extract, Load, Transform) rather than traditional ETL workflows (see my earlier article ETL vs ELT: what's the difference and when to use them? for a detailed comparison of these two data ingestion approaches). In Snowflake, data is first loaded into external stages or internal tables, and then transformed using SQL-based transformations. Adapting existing ETL processes to this ELT approach can require:
- modifying data pipelines to load data directly into Snowflake without the need for a separate transformation step,
- leveraging Snowflake's built-in features like Snowpipe for continuous data ingestion, which may require rethinking the overall data pipeline architecture,
- rewriting transformation logic using Snowflake's SQL-based transformations, which may differ from the transformations used in the previous ETL tools,
- integrating with Snowflake's external stages, which allow for loading data directly from cloud storage (e.g., S3, GCS, Azure Blob) without the need for a separate data loading step.
Optimizing usage costs
Snowflake's pricing model is based on a consumption-based approach, where customers pay for the compute and storage resources they use. This differs from traditional data warehouse pricing, which often involved upfront hardware and software investments, followed by ongoing maintenance and support costs.
Effectively managing Snowflake's costs can be a learning curve for organizations. When you’re starting your journey with Snowflake, you need to understand the compute usage, storage usage, concurrency and scaling, and workload management to optimize them and manage them effectively.
This obviously needs to be coupled with an understanding of Snowflake's pricing model, usage patterns, and optimization techniques to effectively manage and control their Snowflake costs, especially if they are transitioning from a fixed-cost data warehouse model.
Snowflake migration guide: best practices
So, how to transfer data to Snowflake? This is obviously a hefty and challenging tasks. Below, we’re outlining the critical stages in the process as well as the best practices that should facilitate and speed it up.
Prep a thorough inventory
Understanding what is in scope of the migration process is the critical first step. It will help you ensure the scope of the migration is well-defined and then well-executed. The key areas to identify include:
- Data sources and their owners,
- Destinations (understanding the dependencies and data flows to downstream systems or end-user tools),
- Database objects (cataloging tables, schemas, databases, stored procedures, jobs, and gathering statistics),
- Ingestion pipelines and jobs (mapping database objects to ingestion processes and understanding data volumes and frequencies),
- Transformations (documenting the existing data transformations, as they may need to be adjusted for Snowflake),
- Reporting mechanisms (identifying the reporting tools and the reports that need to be migrated or replicated).
Settle on the information architecture
Start by implementing a solid information architecture that includes users, roles, accounts, databases, and schemas. This provides a strong foundation for the rest of the migration process.
A key advantage of Snowflake is its RBAC system, which offers highly granular control and flexibility, but there’s a learning curve when implementing it for the first time. Consult our earlier text on data access policies in Snowflake to find out more about it.
A best practice is to create an access role for each database object and virtual warehouse, as well as a functional role for each function, like Data Analyst, within the organization. This allows the access roles to be assigned to the functional roles, which can then be assigned to individual users.
Choose an authentication method(s)
You have the following methods available to choose from:
- Basic authentication: this is the default and least secure option, requiring only a username and password.
- Federated Authentication/SSO: Snowflake supports SAML 2.0-based single sign-on (SSO) with external identity providers like ADFS, Azure AD, and Google Suite. This allows users to authenticate through the external IdP.
- Key Pair Authentication: Snowflake supports using key pairs for enhanced authentication security as an alternative to basic authentication. It also supports rotating public keys.
- OAuth: Snowflake supports the OAuth 2.0 protocol, allowing authorized access to Snowflake without sharing or storing user credentials. This can be used with applications like Tableau, Looker, and Alation.
- Multi-Factor Authentication: Snowflake supports optional MFA, powered by Duo Security, to increase login security. MFA is enabled on a per-user basis, and Snowflake recommends requiring it for users with the ACCOUNTADMIN role at a minimum.
Pick the best migration approach
You basically have two different options when it comes to migration approaches: lift and shift, or a complete redesign.
Lift and shift:
The "lift and shift" approach refers to the process of moving data from one platform to another without making any changes to the data or its structure.
- Pros: relatively simple and quick process, can handle large data volumes, avoids breaking custom applications.
- Cons: doesn't take advantage of Snowflake's features, can be inefficient with redundant/unnecessary data, misses opportunities to improve security and compliance.
Modernization:
In the context of Snowflake migration, modernization refers to the process of making changes to the data or how it is stored and accessed in order to improve efficiency, security, and performance.
- Pros: allows leveraging Snowflake's capabilities to improve performance, efficiency, security and compliance. Provides an opportunity for strategic data management.
- Cons: more complex and time-consuming, requires significant investment and expertise, can disrupt existing applications and processes.
Identifying the best migration approach for an organization's data environment should involve getting in touch with the key stakeholders to determine several key factors:
- Understanding the driving force behind the EDW (Enterprise Data Warehouse) migration - e.g. expiring licenses, hard deadlines,
- Quantifying the scope of the migration - number of tables/views, amount of raw data,
- Identifying any redundant, unused, or temporary data assets that can be removed to reduce the migration load,
- Determining the optimal time to extract the data with minimal impact on business operations,
- Considering any upcoming release cycles that could affect the data model and workloads.
It's critical for IT experts to:
- Assess the available bandwidth between on-premises data centers and cloud providers,
- Identify and factor in any workload dependencies,
- Designate all legacy data assets as hot, cold or warm to finalize the migration plan and refresh/sync strategies.
Simultaneously, operations teams need to:
- Determine lead times for staging server procurement,
- Manage the security approvals needed to move data assets from on-premises servers.
The overall goal is to gather comprehensive information to develop a well-planned, efficient migration strategy that minimizes disruption to the business.
Prepare the data (when modernizing)
You have to prepare the data for ingestion before transferring it into Snowflake. Examine the source data and identify any discrepancies or inconsistencies in data format, data types, column names, etc. Ensure data cleansing to address issues such as missing values, duplicates, or outliers. Address any data quality issues or inconsistencies in the source data.
Transform the data to match the expected data structure and data types required by Snowflake's data model. Ensure that the data is properly formatted and follows Snowflake's naming conventions for tables, columns, and other objects. Validate that the transformed data aligns with Snowflake's data types, constraints, and other requirements.
You should also look into the appropriate data partitioning and clustering strategies to optimize Snowflake's performance.
Ingest the data
Determine the appropriate data ingestion method for your case. Snowpipe is your best choice for continuous data loading, external stages for loading data from cloud storage, or, you can also perform manual bulk loading.
If using Snowpipe for continuous data loading, configure the necessary infrastructure, including cloud storage buckets, notification services, and Snowpipe definitions. Set up the Snowpipe to automatically ingest data as it becomes available in the source locations, leveraging Snowflake's real-time data loading capabilities.
For loading data from cloud storage, such as Amazon S3, Azure Blob Storage, or Google Cloud Storage, create external stages in Snowflake to represent the source locations. Develop the necessary scripts or workflows to load the data from the external stages into the target Snowflake tables, ensuring that the data formats and schemas are compatible.
If performing manual bulk loading, utilize Snowflake's COPY INTO command to load large datasets from local or cloud-based sources. This approach may be suitable for initial data loads or for less frequent updates, and it provides more control over the data ingestion process.
Handle dependencies
Identify all the dependencies between the objects that need to be migrated: direct circular references, indirect circular references, and self-references.
Next, create a hierarchy of objects based on their dependencies. Remember that circular references can cause issues during the migration process, as they can lead to infinite loops, so they must be handled separately.
Based on the dependency hierarchy, migrate the objects in the correct order, starting with the objects that have no dependencies and then working up the hierarchy. This text on database objects dependencies analysis sheds even more light on handling dependencies when migrating to Snowflake.
Leverage Sowflake’s security features
The key here is to leverage Snowflake's built-in encryption and access control features, while also implementing any additional security requirements your organization has before migrating data.
Use the existing system security as a starting point, but review roles and users to determine if any need to be implemented differently in Snowflake. Consider the following:
- in Snowflake, all data is encrypted at rest and in transit using TLS (the successor to SSL).
- Snowflake also automatically decrypts and re-encrypts data when it is transformed or operated on.
- If your organization requires additional security measures, implement them before moving any data to Snowflake.
- By default, Snowflake allows connections from any IP address, but administrators can create network policies to allow or deny access to specific IP addresses.
- Alternatively, you can use native cloud provider security features like AWS PrivateLink or Azure Private Link to establish a direct private connection between your organization and Snowflake, bypassing the public internet.
Validate and test
Verify the integrity and completeness of the migrated data in Snowflake. Perform comprehensive data validation checks, such as reconciling row counts, comparing sample data, and checking for missing or duplicate data.
Test the functionality of Snowflake, including query performance, data transformations, and integration with downstream applications. Execute a test plan to verify proper platform functioning, such as running test queries and validating data processing.
Conduct end-to-end testing to ensure the migrated data meets the organization's requirements. Involve stakeholders, define test cases, and address any gaps or discrepancies.
Thorough data validation and functional testing are crucial to building confidence in the Snowflake platform and resolving issues before production.
Automate whatever you can
When migrating to Snowflake, there are several aspects that can be automated to improve efficiency and consistency. Automate the process of loading data into Snowflake using tools like Snowpipe, which can continuously ingest data from cloud storage or event-driven sources. Develop scripts or workflows to load data from various source systems into Snowflake’s external stages or directly into tables. Implement error handling and retries to ensure reliable and resilient data ingestion.
You can also automate the data transformation and cleansing processes using Snowflake's built-in capabilities, such as SQL transformations, stored procedures, and user-defined functions. Leverage Snowflake's support for data pipelines and integrate with tools like dbt to manage and version-control the data transformation logic. Implement automated data validation checks to ensure the transformed data meets the required quality standards.
Automate the creation and management of Snowflake database objects, such as tables, views, and materialized views, using scripts or infrastructure as code (IaC) tools like Terraform. Automate the deployment of schema changes, ensuring consistency and reducing manual effort. Implement a versioning and deployment process to manage the evolution of the Snowflake data model.
Automate the analysis and implementation of data partitioning and clustering strategies based on usage patterns and query requirements. Develop scripts or workflows to monitor Snowflake performance and automatically adjust partitioning, clustering, and other optimization settings. Integrate with tools like Snowflake's Query History and Query Profiler to analyze and optimize query performance.
Oracle to Snowflake migration best practices
Oracle is a traditional on-premises relational database, while Snowflake is a cloud-native data warehouse. The migration process involves extracting data from Oracle, transforming it to align with Snowflake's data model, and loading it into Snowflake.
- Oracle is a traditional RDBMS that supports complex transactions with ACID compliance. Snowflake, being a cloud data platform, requires adjustments to schema and queries for optimization.
- For data extraction and loading, use tools like Oracle Data Integrator (ODI) or third-party ETL tools like Informatica, Talend, or Matillion optimized for Snowflake.
- Methods that work best with this migration are direct database links or batch data exports using Oracle tools (e.g., Data Pump).
- Oracle PL/SQL code needs conversion to Snowflake SQL, and Oracle-specific functions and procedures need alternatives in Snowflake.
Teradata to Snowflake migration best practices
Teradata is a traditional on-premises data warehouse, while Snowflake is a cloud-native platform. The migration process may involve extracting data from Teradata, transforming it to fit Snowflake's data model, and loading it into Snowflake.
- Teradata is optimized for complex queries across large datasets. When migrating, it’s crucial to redesign these queries for Snowflake's architecture to maintain performance.
- For data extraction and loading, make use of Teradata’s TPT (Teradata Parallel Transporter) or similar ETL tools that support Snowflake.
- Batch extraction and staging through cloud storage platforms or direct data movement services will work best with this migration process.
- Just like with Oracle, Teradata SQL scripts and stored procedures need conversion to Snowflake SQL. Focus on converting Teradata-specific functions and optimizing for Snowflake’s compute and storage separation.
Snowflake's design and architecture eliminate several administrative tasks and complexities associated with traditional on-premises data warehouse systems, simplifying the migration process. When migrating from Teradata to Snowflake, the following will no longer be relevant, so you don’t need to worry about them:
- Data distribution and primary indexes: Snowflake does not require primary indexes or pre-distribution of data to compute nodes. The architecture separates compute and storage, allowing for instant scaling of parallel compute nodes without the need for data redistribution.
- Indexing and query optimization: Snowflake has a built-in optimizer designed for MPP (Massively Parallel Processing) and the cloud. It automatically optimizes parallel execution plans, eliminating the need to manage indexes (USIs and NUSIs).
- Workload management: Snowflake's multi-cluster architecture allows you to create separate virtual warehouses for different workloads, avoiding resource contention.
- Statistics collection: Snowflake automatically captures statistics, relieving DBAs from the task of setting up jobs to collect statistics for performance tuning.
- Capacity planning: With Snowflake's pay-as-you-go model and elastic storage/compute architecture, there is no risk of over- or under-configuring the system.
Hadoop to Snowflake migration
Hadoop is a distributed file system and processing framework, while Snowflake is a fully managed data warehouse. The migration process here will involve extracting data from Hadoop data lakes or data warehouses, and loading it into Snowflake.
- With this migration, the focus is more on optimizing data partitioning and clustering in Snowflake. If your data is already in a clean, processed format, data transformation efforts will be minimal.
- Hadoop commonly uses file formats like Parquet, Avro, or ORC stored in HDFS. Snowflake handles these formats but requires data transformation for optimal performance.
- For data extraction and loading, use Apache Sqoop for bulk data transfer or custom scripts using Apache Spark to read from HDFS and write to Snowflake.
- Use external stages in Snowflake to load data directly from Hadoop-compatible object storage, if possible.
- Minimal transformation if data is already in a clean, processed format. With this migration, the focus is more on optimizing data partitioning and clustering in Snowflake.
How long does it take to migrate to Snowflake?
The time it takes to migrate data to Snowflake can vary significantly based on several key factors:
- Data volume: the amount of data being migrated is a primary driver of migration duration. Larger data volumes will naturally take longer to migrate.
- Data complexity: the structure and complexity of the data, such as the number of tables, columns, relationships, and data types, can impact migration times.
- Network bandwidth: the available network bandwidth between the source data systems and Snowflake can greatly influence migration speeds. Higher bandwidth allows for faster data transfers.
- Snowflake provisioned resources: the compute and storage resources provisioned for the target Snowflake account can affect ingest rates. More powerful resources can enable faster data loading.
- Migration approach: the specific migration tools and techniques used can impact performance. Some approaches, like bulk unload/load, may be faster than incremental change data capture.
As a general guideline, a typical data migration to Snowflake can take anywhere from a few hours to several days or even weeks, depending on the scale of the data environment.
For a rough estimate, a migration of a few terabytes of data with moderate complexity could take anywhere from a few hours to a couple of days, assuming adequate network bandwidth and Snowflake provisioning. Larger migrations of 10+ terabytes could take a week or more.
The best approach is to carefully assess the specific data landscape, network capabilities, and Snowflake configuration, then plan and test the migration accordingly to arrive at a realistic timeline.
Take the next step towards Snowflake migration
Migrating to Snowflake can be a transformative journey for your business, unlocking the power of a modern, scalable, and efficient data platform. From automated data ingestion and transformation to performance optimization and disaster recovery, the opportunities to leverage Snowflake's capabilities are vast. But the true value lies in how you strategically implement and manage this powerful platform.
As a certified Snowflake partner and a data analytics consultancy with software development capabilities, we can help you leverage the platform's built-in scalability features to efficiently manage varying workloads without the need for manual scaling. Snowflake handles the underlying resource management and we leverage this property fully to allow you to fully focus on your core business.
We are also well aware of the platform's cost optimization opportunities and can help you analyze your current data spending, implement cost control measures, set query limits, and, as a result, maximize your cost savings.
If you’re interested in taking the next step in your transition to Snowflake, just send us a message via this contact form and we’ll be in touch promptly to schedule a free data consulting session. Don't miss out on this chance to revolutionize your data management and unlock the full potential of Snowflake.