With data volumes growing massively and data storage costs decreasing, businesses are learning to take advantage of big data. Unfortunately, instead of reaping instant benefits, business users have realized that in spite of data governance frameworks they already have, they have to involve an extended set of data analytics services to solve challenges connected with multiple data formats and security.
For this reason, big data governance requires a different approach to ensure that the right data can be accessed by the right people, who can then use it in data-driven decision-making.
A big data governance plan should be tailored to business needs and industry laws while taking into account essential characteristics and requirements of big data processing.
Actually, there shouldn’t be any difference—the same principles should apply to both. Big data governance is very similar to traditional data governance in terms of challenges and principles. However, the former has to take into account a number of peculiar big data characteristics:
As traditional data governance tools on the market today can’t support big data processing requirements, it’s an imperative for businesses to rethink their data analytics strategy and for existing technologies to mature and evolve to meet new challenges.
Data governance crosses departmental borders and influences each department’s bottom line, either in good or bad ways. It requires collaboration across the entire enterprise and clearly defined roles and data ownership levels. When everybody knows who has data-related authority and responsibilities, it becomes possible to avoid chaos and mistakes and allows data governance socialization.
Depending on the organizational size and goals, the following roles may be needed for a powerful big data team:
It’s clear that big data technologies are able to replace some of the above-mentioned roles, like architects and data scientists, or some of the roles can merge and combine responsibilities. However, it’s important to involve the required professionals along the technology implementation.
It’s important to understand that not all data can be governed in the same way, particularly when we deal with big data. There can be three levels of governance:
It’s important to monitor data across all governance levels—data is constantly changing and can require a shift to another governance level.
When it comes to big data, even such common procedures as accumulation and storage turn into challenges, let alone analysis and forecasting. Here are some of the must-haves that can make a difference here.
Big data and rigid control don’t fit together. To allow for different levels of governance, it’s necessary to develop a framework that will keep everyone in the company on the same page. Each enterprise can have its own unique framework aligned with business objectives and vision, but to achieve sustainable governance, it’s necessary to take into account the following components.
To make sure relevant data is collected and processed, everyone moves in the same direction, and there are metrics available to measure progress and success, it’s necessary to explain from top to bottom why big data governance is essential (maybe by using data storytelling) and develop a big data mission and vision based on these objectives.
A big data governance strategy requires a professional team that will obtain, manage, use, and protect data. Based on the organizational structure, it’s necessary to establish which internal roles you need—data architects, data scientists, data owners, data stewards, or others. Once the roles are assigned, it’s possible to delegate authority and responsibility for correct data sharing and use.
Communication opportunities and barrier-free access to data should be provided to make employees feel they are a team rather than isolated stakeholders. It’s also important to establish an ongoing training program and enroll all data roles into related big data governance training.
Another important point is to inspire a data culture within a data governance team. Ideally, it should be a culture of participation, sustainability, and enablement of data quality and compliance.
Big data governance management together with data owners and data stewards should develop a set of rules and regulations, like data policies and standards, to regulate data capture, management, usage and protection. All the actors of a big data governance process should be aware of data usage and compliance laws and internal practices, know how to act within the legislation, and use data correctly and legally.
Businesses decide to implement big data governance to keep data safe, above all. In addition to powerful control mechanisms of enterprise cybersecurity, employees who interact with data should be aware of sensitive data security practices and follow the established rules during data processing and change. There should be a system of access levels that regulates who can view and change different types of data.
Are traditional enterprise data warehouses (EDW) dead? Of course, not. However, to see any benefits from the business impacts of big data, there’s a need for a new kind of architecture that combines an EDW environment and innovative technologies able to process multi-structured data. For this purpose, an extended data warehouse architecture, or XDW, was introduced. Let’s review its layers and components.
The data layer stores massive amounts of structured and unstructured data. It can be raw data stored on-premises in relational databases, NoSQL databases, distributed file systems, or in the cloud via services like AWS or Microsoft Azure.
The layer can also include real-time streaming data—large stream-processed chunks of data, continuously generated by multiple sources and used in motion through the server (as opposed to data that is first stored and indexed prior to processing). It can be in-app activity, social media sentiment, telemetry from IoT devices, and more.
This layer is used to add data into the data layer. In addition to traditional integration with meticulously designed ETL processes, here it’s possible to use a data refinery. The latter ingests raw structured and unstructured data in batch and in real time from such sources as IoT devices or social media, transforms it into useful information, and feeds to other XDW components.
A data refinery is used to determine the value of big data. By means of a rough analysis, it’s possible to understand which data is useful and quickly discover interesting data. The process requires flexible data governance as the resulting data may not require integration and quality processing (but flexibility doesn’t exclude security and privacy).
This is where a traditional EDW sits, taking all data, structuring it into a format suitable for querying SQL and data warehouse OLAP servers, and pushing it to BI tools. It’s still the best source of clean, reliable and consistent data for critical analysis in financial or regulatory fields. It’s also the source of data for KPIs and other standard metrics used by various departments within a company.
Investigative technologies, such as Hadoop or Spark, deal with more unusual types of data and various experiments. They explore big data sources and deal with such analytical methods like data mining, pattern analysis, or even custom investigations. The use scenarios of such technologies vary from simple sandboxes for experiments to full-scale analytical platforms. In any case, they allow analyzing large data volumes at high speed and use this data in an EDW, a real-time analysis engine, or standalone business apps.
Here technologies for data visualization and cloud business intelligence allow data scientists and analysts to explore data, ask it questions, build and interact with visualizations, and more.
Another component is a platform that supports streaming analytics and development of real-time analytical apps. Its application use cases cover fraud detection, traffic flow optimization, risk analytics, etc. The platform is tightly integrated with other components, like EDWs or investigative technologies, to freely transfer data to and from them.
All these components can’t function in isolation from each other—all of them must be brought together, complemented by data governance.
For one of our data analytics projects, Itransition partnered with a US-based multinational company that provided advanced pharmaceutical analytics and technologies. The customer accumulated over 500 million patient records of more than 50 thousand patents, not to mention petabytes of proprietary data. However, their legacy system limited their abilities to get more value out of this growing data, so they approached us to help them create a business intelligence project plan, migrate to the cloud, and improve data management capabilities.
The customer’s data analytics platform comprised a toolset for generating reports based on multiple structured and unstructured data sources. The system couldn’t support the company’s needs and adapt to the changing market, so it required dramatic reengineering and optimization in terms of UI, data processing, and report generation.
Itransition developed a new BI platform on the ASP.NET MVC framework, with Microsoft SQL Server as a database engine, where we delivered the redeveloped functionality, flexibility, and scalability. It resulted in 3-5x faster SQL querying and reduced RAM and CPU usage.
The legacy platform didn’t support multiple data source formats and had an outdated ETL configuration, which slowed down data processing. As a result, data processing could take days, with some sources excluded from processing. Additionally, non-tech users couldn’t participate in ETL processes and needed a user-friendly interface to interact with data.
We developed a data management app and integrated it with several database engines (Oracle, Microsoft SQL) and Apache Hadoop to enable distributed storage and processing of large datasets. It enabled 10x faster data processing and less memory and space usage. The app also became accessible for non-tech users, who could visualize data and get reports within minutes. The system was able to process various data sources, transform data, and prepare different output forms, be it databases or files. This way, users were able to deliver prepared data to other destinations, like cloud storages, FTP servers, or other teams.
To maintain a high system performance when the number of users was constantly growing, we initiated the system migration from the on-premises server to the cloud. Our DevOps specialists audited the existing infrastructure and prepared a migration roadmap. We designed a scalable and secure cloud infrastructure and deployed it to AWS. As a result, the customer got a virtual private cloud with private and public subnets, defined network gateways, and fine-tuned security settings.
To ensure security of large volumes of sensitive data, we used Amazon S3. Critical data was backed up via AWS tools. We utilized Amazon RDS to create and save automated backups of database instances. For enhanced security, we used AWS services to store passwords and license codes as encrypted parameters and enabled secure configuration of managed instances and password reset.
Our solution is now used by many leading pharmaceutical corporations, enabling them to handle data of multiple formats from various sources and manage their data assets efficiently and securely with big data governance tools.
Big data is disrupting traditional data management. Taking into account the predictions for the future of big data, enterprises consider it urgent to seek new ways and new technological solutions that can help process large amounts of multi-format data in an efficient and secure way. Big data governance is an essential component of a brand-new approach to data, and it’s important to get it right by means of a tailored framework and infrastructure.