With data volumes growing massively and data storage costs decreasing, businesses are learning to take advantage of big data. Unfortunately, instead of reaping instant benefits, business users have realized that in spite of data governance frameworks they already have, they have to involve an extended set of data analytics services to solve challenges connected with multiple data formats and security.
For this reason, big data governance requires a different approach to ensure that the right data can be accessed by the right people, who can then use it in data-driven decision-making.
Big data governance Q&A
A big data governance plan should be tailored to business needs and industry laws while taking into account essential characteristics and requirements of big data processing. For instance, data governance for healthcare and retail would both cover personal information but security measures for its transfer for each case would differ.
Big data governance vs data governance: what’s the difference?
Actually, there shouldn’t be any difference—the same principles should apply to both. Big data governance is very similar to traditional data governance in terms of challenges and principles. However, the former has to take into account a number of peculiar big data characteristics:
- The volume of big data can reach petabytes and more
- Big data can be structured, semi-structured, or unstructured
- Big data repositories span files, NoSQL databases, data lakes, and streams
- Data is extracted from internal and external sources, including connected devices
- Data is processed in real time
As traditional data governance tools on the market today can’t support big data processing requirements, it’s an imperative for businesses to rethink their data analytics strategy and for existing technologies to mature and evolve to meet new challenges.
Who is responsible for data governance?
Data governance crosses departmental borders and influences each department’s bottom line, either in good or bad ways. It requires collaboration across the entire enterprise and clearly defined roles and data ownership levels. When everybody knows who has data-related authority and responsibilities, it becomes possible to avoid chaos and mistakes and allows data governance socialization.
Depending on the organizational size and goals, the following roles may be needed for a powerful big data team:
- A data governance committee consists of top managers who are responsible for data strategy creation or approval, prioritization of projects, and authorization of data policies and standards.
- Chief data officers (CDOs) participate in data strategy development, oversee data framework implementation, and use data as a strategic asset. They create data standards, policies, and practices, and grow corporate data cultures.
- Big data architects are proficient in relevant technologies and understand relationships between them. They are responsible for designing big data processing solutions that are able to solve any data-related problem. They can be deeply engaged in data governance, automation and security.
- Data engineers set up systems for accumulating, cleaning, and organizing data from multiple sources, and transferring it to data warehouses.
- Data scientists/data analysts are responsible for analyzing large sets of structured and unstructured data, creating algorithms and predictive models, and extracting trends and insights relevant to the business.
- Data owners are team members who use data and are accountable for data assets in terms of quality and security within their teams.
- Data stewards closely cooperate with data owners and oversee how the latter execute the data strategy and whether they follow data policies and standards. They also participate in training new data owners. As big data is often collected but not used due to the absence of professionals, currently there’s a significantly higher demand for data stewards and related expertise.
- Data users are team members who interact with data to perform their daily activities. They enter data, access different datasets, and generate reports.
- IT teams are responsible for technology implementation and customization, development of additional features for big data processing, audit, security, and maintenance.
It’s clear that big data technologies are able to replace some of the above-mentioned roles, like architects and data scientists, or some of the roles can merge and combine responsibilities. However, it’s important to involve the required professionals along the technology implementation.
What are data governance levels?
It’s important to understand that not all data can be governed in the same way, particularly when we deal with big data. There can be three levels of governance:
- Strictly governed data is already vetted, rationalized, organized, and optimized for performance.
- Loosely governed data can be of two types. It can be data used by data scientists to run experiments, make approximations, and search for trends and patterns. Or it can be data that doesn’t require preparation or needs a minimal amount of rationalization, like key IDs.
- Non-governed data is raw data in its purest form, with no additional keys. Such data can be used for ‘schema on read’ analysis—data is stored in an unorganized and unstructured format and gets organized for particular purposes.
It’s important to monitor data across all governance levels—data is constantly changing and can require a shift to another governance level.
Big data governance must-haves
When it comes to big data, even such common procedures as accumulation and storage turn into challenges, let alone analysis and forecasting. Here are some of the must-haves that can make a difference here.
A big data governance framework
Big data and rigid control don’t fit together. To allow for different levels of governance, it’s necessary to develop a framework that will keep everyone in the company on the same page. Each enterprise can have its own unique framework aligned with business objectives and vision, but to achieve sustainable governance, it’s necessary to take into account the following components.
To make sure relevant data is collected and processed, everyone moves in the same direction, and there are metrics available to measure progress and success, it’s necessary to explain from top to bottom why big data governance is essential (maybe by using data storytelling) and develop a big data mission and vision based on these objectives.
A big data governance strategy requires a professional team that will obtain, manage, use, and protect data. Based on the organizational structure, it’s necessary to establish which internal roles you need—data architects, data scientists, data owners, data stewards, or others. Once the roles are assigned, it’s possible to delegate authority and responsibility for correct data sharing and use.
Communication opportunities and barrier-free access to data should be provided to make employees feel they are a team rather than isolated stakeholders. It’s also important to establish an ongoing training program and enroll all data roles into related big data governance training.
Another important point is to inspire a data culture within a data governance team. Ideally, it should be a culture of participation, sustainability, and enablement of data quality and compliance.
Big data governance management together with data owners and data stewards should develop a set of rules and regulations, like data policies and standards, to regulate data capture, management, usage and protection. All the actors of a big data governance process should be aware of data usage (to ensure this, conduct regular data audits), compliance laws and internal practices, know how to act within the legislation, and use data correctly and legally.
Businesses decide to implement big data governance to keep data safe, above all. In addition to powerful control mechanisms of enterprise cybersecurity, employees who interact with data should be aware of sensitive data security practices and follow the established rules during data processing and change. There should be a system of access levels that regulates who can view and change different types of data.
Extended data warehouse architecture
Are traditional enterprise data warehouses (EDW) dead? Of course, not. However, to see any benefits from the business impacts of big data, there’s a need for a new kind of architecture that combines an EDW environment and innovative technologies able to process multi-structured data. For this purpose, an extended data warehouse architecture, or XDW, was introduced. Let’s review its layers and components.
The data layer
The data layer stores massive amounts of structured and unstructured data. It can be raw data stored on-premises in relational databases, NoSQL databases, distributed file systems, or in the cloud via services like AWS or Microsoft Azure.
The layer can also include real-time streaming data—large stream-processed chunks of data, continuously generated by multiple sources and used in motion through the server (as opposed to data that is first stored and indexed prior to processing). It can be in-app activity, social media sentiment, telemetry from IoT devices, and more.
The integration and ingestion layer
This layer is used to add data into the data layer. In addition to traditional integration with meticulously designed ETL processes, here it’s possible to use a data refinery. The latter ingests raw structured and unstructured data in batch and in real time from such sources as IoT devices or social media, transforms it into useful information, and feeds to other XDW components.
A data refinery is used to determine the value of big data. By means of a rough analysis, it’s possible to understand which data is useful and quickly discover interesting data. The process requires flexible data governance as the resulting data may not require integration and quality processing (but flexibility doesn’t exclude security and privacy).
The processing layer
This is where a traditional EDW sits, taking all data, structuring it into a format suitable for querying SQL and data warehouse OLAP servers, and pushing it to BI tools. It’s still the best source of clean, reliable and consistent data for critical analysis in financial or regulatory fields. It’s also the source of data for KPIs and other standard metrics used by various departments within a company.
Investigative technologies, such as Hadoop or Spark, deal with more unusual types of data and various experiments. They explore big data sources and deal with such analytical methods like data mining, pattern analysis, or even custom investigations. The use scenarios of such technologies vary from simple sandboxes for experiments to full-scale analytical platforms. In any case, they allow analyzing large data volumes at high speed and use this data in an EDW, a real-time analysis engine, or standalone business apps.
The analytics and BI layer
Here technologies for data visualization and cloud business intelligence allow data scientists and analysts to explore data, ask it questions, build and interact with visualizations, and more.
Another component is a platform that supports streaming analytics and development of real-time analytical apps. Its application use cases cover fraud detection, traffic flow optimization, risk analytics, etc. The platform is tightly integrated with other components, like EDWs or investigative technologies, to freely transfer data to and from them.
All these components can’t function in isolation from each other—all of them must be brought together, complemented by data governance.
We can build an efficient and secure big data solution
Itransition’s big data governance project
For one of our data analytics projects, Itransition partnered with a US-based multinational company that provided advanced pharmaceutical analytics and technologies. The customer accumulated over 500 million patient records of more than 50 thousand patents, not to mention petabytes of proprietary data. However, their legacy system limited their abilities to get more value out of this growing data, so they approached us to help them create a business intelligence project plan, migrate to the cloud, and improve data management capabilities.
We redeveloped a BI platform
The customer’s data analytics platform comprised a toolset for generating reports based on multiple structured and unstructured data sources. The system couldn’t support the company’s needs and adapt to the changing market, so it required dramatic reengineering and optimization in terms of UI, data processing, and report generation. Itransition developed a new BI platform on the ASP.NET MVC framework, with Microsoft SQL Server as a database engine, where we delivered the redeveloped functionality, flexibility, and scalability. It resulted in 3-5x faster SQL querying and reduced RAM and CPU usage.
We delivered a data management and data visualization app
The legacy platform didn’t support multiple data source formats and had an outdated ETL configuration, which slowed down data processing. As a result, data processing could take days, with some sources excluded from processing. Additionally, non-tech users couldn’t participate in ETL processes and needed a user-friendly interface to interact with data.
We developed a data management app and integrated it with several database engines (Oracle, Microsoft SQL) and Apache Hadoop to enable distributed storage and processing of large datasets. It enabled 10x faster data processing and less memory and space usage. The app also became accessible for non-tech users, who could visualize data and get reports within minutes. The system was able to process various data sources, transform data, and prepare different output forms, be it databases or files. This way, users were able to deliver prepared data to other destinations, like cloud storages, FTP servers, or other teams.
We migrated to the cloud
To maintain a high system performance when the number of users was constantly growing, we initiated the system migration from the on-premises server to the cloud. Our DevOps specialists audited the existing infrastructure and prepared a migration roadmap. We designed a scalable and secure cloud infrastructure and employed AWS DevOps architecture. As a result, the customer got a virtual private cloud with private and public subnets, defined network gateways, and fine-tuned security settings.
To ensure security of large volumes of sensitive data, we used Amazon S3. Critical data was backed up via AWS tools. We utilized Amazon RDS to create and save automated backups of database instances. For enhanced security, we used AWS services to store passwords and license codes as encrypted parameters and enabled secure configuration of managed instances and password reset.
We delivered excellent results for the long run
Our solution is now used by many leading pharmaceutical corporations, enabling them to handle data of multiple formats from various sources and manage their data assets efficiently and securely with big data governance tools.
Big data is disrupting traditional data management. Taking into account the predictions for the future of big data, enterprises consider it urgent to seek new ways and new technological solutions that can help process large amounts of multi-format data in an efficient and secure way. Big data governance is an essential component of a brand-new approach to data, and it’s important to get it right by means of a tailored framework and infrastructure.