Everything hinges on data these days. What products to sell, how to produce, market and sell them, whom to sell to, and how to make customer experience unique – companies across industries are trying hard to embed data-driven insights into every business decision, customer interaction, and process. In the meantime, analytics and data warehousing are gaining traction as enablers of decision-making across organizations today.
Let’s investigate the concept of an enterprise data warehouse and study its architecture and types. We will also define which DWH capabilities make companies seek data warehousing services to deploy one and outline factors guiding the data warehouse tech stack selection.
What is an enterprise data warehouse?
An enterprise data warehouse (EDW) is a solution that collects business information from multiple heterogeneous sources and then consolidates and stores it in a centralized repository ready for analytics and reporting.
The fundamental purpose of an enterprise data warehouse is to break down data silos, restructure corporate information and make it easily accessible for companies to make timely evidence-driven decisions and operate efficiently in a rapidly changing environment.
Enterprise data warehouse vs data mart
Data marts (sometimes referred to as ‘traditional’ or ‘usual’ data warehouses) are actually subsets of an enterprise data warehouse. They have the same functionality as enterprise data warehouses – collecting data from different sources and making it readily available for analysis. However, they have a narrower scope as they are designed to meet the needs of particular departments or lines of business. This means that they collect not all the data available, but the one that its users (sales, marketing, HR, etc.) need. Data marts are usually created for the sake of speed, since their requests are more specific and their queries run against a smaller amount of information.
Enterprise data warehouses vs operational data stores (ODS)
Operational data stores contain up-to-date information in its original format consolidated from disparate operational sources to enable real-time reporting. Although it may seem rather similar to what a typical data warehouse does, they have rather significant distinctions in terms of:
- Data timeliness
Enterprise data warehouses store historical as well as current data, while ODS storage capabilities are limited to maintaining the most recent records, making it a go-to tactical tool.
- Data aggregation
Enterprise data warehouses host information after it undergoes particular transformations (cleansing, enrichment, reformatting, etc.), while ODS keep information in its raw state in order not to sacrifice the speed of analysis and reporting.
- Query complexity
Enterprise data warehouses are built for running complex analytics queries on huge volumes of information, while ODS are used for relatively simpler queries run on real-time data – for example, to quickly identify the reason behind a failed transaction.
As you see, data marts and ODS are not substitutes for a data warehouse, but rather complement the whole enterprise data warehousing environment.
Why build an enterprise data warehouse
Consolidate corporate information
You want to assess your business performance at an enterprise scale, run risk analytics, conduct strategic planning, or perform any other activity, which requires the analysis of cross-functional historical information. However, you cannot do that because the information you have is either scattered across the company’s systems (ERPs, CRMs, etc.) or randomly piled up in a data lake and ungoverned.
Separate operational and analytics workloads
Accelerating the speed of analysis is a competitive advantage you need to embrace, but running analytics queries against your OLTP databases is extremely slow and results in system failures, as it involves reading millions of rows of data.
Centralize data governance and management
You don’t have a unified approach to data governance and management, which leads to data inconsistencies, data redundancy, varying data quality, data access constraints as a result of security policies, duplicated efforts, etc. All this tangles up analytics workflows and in some cases even compromises the analytics results.
Automate data management
All data management activities (the extraction, cleansing, transformation, etc.) are manual and therefore resource-consuming and error-prone.
Facilitate self-service BI
Your business users are rather dependent on IT teams for day-to-day reporting. This leads to tech specialists becoming overburdened with routine analytics activities on the one hand, while business users can’t obtain the insights they need.
6 components of an enterprise data warehouse
It may seem that an enterprise data warehouse is merely a repository that is connected to your data sources (CRM, IoT devices, SaaS apps, etc.) from one side and to the BI and analytics software from the other. In truth, an enterprise data warehousing solution is the whole data processing and storage environment, which consists of the following components:
- An extract, transform, and load (ETL) or an extract, load, and transform (ELT) solution are tools that ingest information from the source system and process it until it’s suitable for permanent storage. It’s one of the core elements because when performing enterprise-grade analytics, companies have to rely on numerous data sources, which besides storing different types of data, each has its own data model and generates information at various speeds.
- A staging area is a temporary raw data repository sitting in between data sources and its permanent storage. It serves to host the data during the transformation stage. This element is typical for the solutions built with the ETL approach but can be omitted if the transformations are performed in the data warehouse database.
- A relational data warehouse database, which maintains integrated and subject-oriented information.
- Dimensional data marts built for specific groups of users.
- OLAP cubes storing pre-aggregated data, which help streamline analysis and facilitate ad-hoc capabilities for business users.
- Data governance component is responsible for defining processes and policies for managing data quality, security, compliance, metadata management, etc.
Enterprise data warehouse architecture
Modern enterprise data warehouse solutions are usually built according to the three-tier architecture, which includes:
- Data warehouse server (bottom tier)
- OLAP server (middle tier)
- Data access layer (top tier)
However, the architectural approach should be dictated by the specific needs of a company, so in some cases other design methods (e.g. a one-tier or a two-tier architecture) may prove to be more suitable.
Enterprise data warehouse typology
When setting up an enterprise data warehouse, businesses have to choose whether to do that in the cloud, on-premises, or via a hybrid environment.
This option implies deploying and running an enterprise data warehouse on the server located onsite with the help of an in-house IT team. It is a sound choice for companies, which need comprehensive control over the data warehouse hardware and software infrastructure to:
- Ensure DWH’s high availability and security
- Comply with data regulations, which require keeping data onsite.
However, be cautious, as the option requires heavy upfront investments (hardware acquisition, software licenses, IT resources, etc.), comprehensive capacity planning (since on-premises EDWs are not easy and quick to scale), and an experienced IT team to keep the system running efficiently.
When an enterprise data warehouse is hosted and managed in the public cloud, then all hardware-related costs, software licenses, infrastructure audits and maintenance are on the vendor.
This option, representing a core trend of data warehouse modernization, is beneficial in many ways, and this is why cloud data warehouse adoption rates are constantly increasing. Among the significant benefits are:
- Quick deployment and fast and cost-effective scaling of storage and computational resources, as you can either add resources manually or set up automatic up-or-down scaling.
- Minimized upfront costs since you don’t need to buy hardware and most vendors offer a pay-as-you-go model.
- Fault tolerance and disaster recovery due to the distributed nature of the cloud data warehouses. If one server fails, others activate and substitute the failed one automatically.
- Low maintenance, since although cloud data warehouses still require administration, you do not have to allocate a whole team for that. Hardware and software updates are also automatic and are already covered by the subscription fee.
Cloud data warehouses may seem an optimal option but for two things: data compliance and hidden costs. Regardless of the fact that major cloud vendors meet multiple compliance requirements, there are still cases when you must not store data in the public cloud (for example, national infrastructure data, industrial secrets, etc.). As for costs, even though cloud vendors do offer calculators and free plans to estimate possible expenditures, pricing still isn’t as straightforward as one imagines. Through the efforts of skillful marketers, such factors as extra fees for moving your data out of the cloud (egress fees), hot data storage, excess compute, geo-redundancy, etc. may pass undetected.
Getting the best of both worlds (on-premises and cloud data warehouses) is a go-to variant for companies that:
- Want to operate efficiently in the cloud while still meeting the strictest regulatory requirements and addressing data latency issues.
- Have solid (in-house or outsourced) expertise in developing and maintaining data warehousing solutions in hybrid environments.
Need expert help to choose optimal technology for your DWH?
How to choose an enterprise data warehouse platform
As the field of data analytics is being heavily invested in, the diversity of data warehouse software technologies may be overwhelming. A properly selected tech stack can decrease your enterprise data warehouse implementation and maintenance costs, while an unsuitable solution can result in negative ROI. So when assessing enterprise data warehouse technologies, you need to consider multiple factors:
When choosing an appropriate platform, besides targeting your current data volumes, you have to also take into account your annual information growth rates as well as strategic data management and analytics objectives, including plans to support your decision-making with new data sources. It will help you avoid overwhelming storage costs and EDW scaling inflexibility.
Tip: If your current and target data volumes are measured in terabytes, petabytes, etc., you may want to consider cloud data warehouses with separate storage and compute costs or those that charge for cold storage at a discounted price.
Your enterprise data warehousing solution should be able to ingest, consolidate and store your specific information, be it real-time or historical, structured or unstructured, coming in bulk or streams.
Tip: Make sure the software you are considering can integrate with the existing source systems via pre-built connectors or open APIs. In case you have plans to run analysis on huge volumes of unstructured data or enable streaming analytics, you would want to consider a data warehouse platform that can be seamlessly complemented with a data lake and an ODS.
You do not want to build an expensive data warehousing solution that nobody uses because of poor user experience. The two main reasons lying behind an enterprise data warehouse’s slow performance are the storage and computational bottlenecks.
Therefore, make sure that the system you are planning to use can promptly upscale to accommodate the spikes. To choose the technology that suits your use cases (data preparation, data analysis, data reporting, etc.) best, you have to consider:
- Data warehouse users (data analysts, executives, operational managers, etc.)
- Number of end-users
- How often the solution would be used
- Query concurrency, etc.
To ensure the enterprise data warehouse is available for users anytime they need but in case of system failures the data could still be restored within a reasonable time, your data warehousing solution should support such functionality as:
- Automatic data backup
- Data restoration capabilities (for example, from a snapshot taken a day earlier)
- Geo-redundancy (in case of natural disasters)
- Continuous system health monitoring, etc.
One of the challenges when dealing with the enterprise data warehousing solutions is making data accessible to end-users without sacrificing its security. To ensure this, your data warehousing platform should support:
- Granular access control to grant data access to specific users or user groups on row and column levels.
- Multi-factor user authentication to let a data warehouse user access corporate information only after verifying their identity 2+ times.
- Information encryption performed by default or configured manually, at rest and in transit, etc. to safeguard sensitive information (personal records, financial data, etc.).
- Dynamic data masking for hiding data elements from certain users and groups.
- Compliance with regional and industry-specific data standards and regulations (HIPAA, GDPR, PCI DSS, etc.)
A data warehouse is only viable when it can be integrated with analytics and BI services; otherwise, it is just an expensive archive. Many vendors offer considerable discounts for product packs (data warehouse services pre-integrated with BI and analytics services from the same vendor) to help you streamline the deployment process and save resources. However, to avoid vendor lock-ins and future-proof your solution, we recommend you choose EDW platforms with rich on-premises and cloud data integration capabilities (open APIs, pre-built connectors, etc.) to make it as vendor-agnostic as possible.
We’ve already mentioned that if you are going for the on-premises data warehouse solution, be ready for heavy upfront investments for hardware acquisition, software licensing, and personnel. Besides that, you would also need to cover the ongoing hardware and software updates, physical space, power consumption, etc.
What concerns the cloud EDWs, most vendors offer on-demand plans (when you pay for the resources you used) and pre-purchase plans (when you reserve storage and compute resources ahead) to choose from. To choose the optimal platform configuration and an optimal pricing option, you have to define:
- Current data volumes
- Target data volumes
- Number of users and their goals (how often they are going to use the solution and what for)
- Data transformation complexity, etc.
Tip: Cloud EDW cost efficiency may be achieved with decoupled storage and compute resources, data compressions, flexible indexing, automatic workload management with pause and resume capabilities, materialized view support, and query results caching among others.
When dealing with on-premises solutions, maintenance can be a huge pain point. You have to ensure there is always an IT team in place to make quick adjustments to the data warehouse without any disruptions to your day-to-day business operations.
In the case of cloud solutions, you may find data warehousing services with a high level of self-optimization, but some companies prefer to still maintain the data warehouse manually for the sake of greater control and flexibility.
If you have decided on cloud deployment, keep in mind that cloud vendors do everything possible to make you stick with them no matter what. They may charge you extra for moving data out of their cloud, integration with third party solutions, running multi-cloud, and whatnot. So if you are looking for a truly omnicloud experience or want to go for hybrid data warehousing, the technology choice may be rather limited.
If you want to get more value out of your data, consider building an enterprise data warehouse to support and streamline your data-driven journey. There has never been a better time, with the volume of information is growing, computing powers are increasing, and the technology is already here, getting more advanced and affordable.