Get full access to Architecting Data-Intensive SaaS Applications and 60K+ other titles, with a free 10-day trial of O'Reilly.
There are also live events, courses curated by job role, and more.
In the last decade we’ve seen explosive growth in data, driven by advances in wireless connectivity, compute capacity, and proliferation of Internet of Things (IoT) devices. Data now drives significant portions of our lives, from crowdsourced restaurant recommendations to artificial intelligence systems identifying more effective medical treatments. The same is true of business, which is becoming increasingly data-driven in its quest to improve products, operations, and sales. And there are no signs of this trend slowing down: market intelligence firm IDC predicts the volume of data created each year will top 160 ZB by 2025, 1 a tenfold increase over the amount of data created in 2017.
This enormous amount of data has spurred the growth of data applications—applications that leverage data to create value for customers. Working with large amounts of data is a domain unto itself, requiring investment in specialized platforms to gather, organize, and surface that data. A robust and well-designed data platform will ensure application developers can focus on what they do best—creating new user experiences and platform features to help their customers—without having to spend significant effort building and maintaining data systems.
We created this report to help product teams, most of which are not well versed in working with significant volumes of fast-changing data, to understand, evaluate, and leverage modern data platforms for building data applications. By offloading the work of data management to a well-designed data platform, teams can focus on delivering value to their customers without worrying about data infrastructure concerns.
This first chapter provides an introduction to data applications and some of the most common use cases. For each use case, you will learn what features a data platform needs to best support data applications of this type. This understanding of important data platform features will prepare you for Chapter 2, where you will learn how to evaluate modern data platforms, enabling you to confidently consider the merits of potential solutions. In Chapter 3 we’ll explore design considerations for scalability, a critical requirement for meeting customer demand and enabling rapid growth. This chapter includes examples to show you how to put these best practices into action. Chapter 4 covers techniques for efficiently transforming raw data within the context of a data application and includes real-world examples. In addition to consuming data, teams building effective data applications need to consider how to share data with customers or partners, which you will learn about in Chapter 5. Finally, we will conclude in Chapter 6 with key takeaways and suggestions for further reading.
Throughout this report we provide examples of how to build data applications using Snowflake, a modern platform that enables data application developers to realize the full potential of the cloud while reducing costs and simplifying infrastructure.
The Snowflake Data Cloud is a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. 2 Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single, seamless experience across multiple public clouds.
Data applications are customer- or employee-facing applications that process large volumes of complex and fast-changing data, embedding analytics capabilities that allow users to harness their data directly within the application. Data applications are typically built by software companies that market their applications to other businesses. As you learn about some of the most common use cases of data applications in this chapter, you will get a sense of the breadth of this landscape. Truly, we are living in a time when most applications are becoming data applications.
Retail tracking systems, such as those used by grocery stores to track shopping habits and incentivize shoppers, are data applications. Real-time financial fraud detection, assembly line operation monitoring, and machine learning systems improving security threat detection are all data applications. Data applications embed tools, including dashboards and data visualizations, that enable customers to better understand and leverage their data. For example, an online payments platform with an integrated dashboard enables businesses to analyze seasonal trends and forecast inventory needs for the coming year.
As shown in Figure 1-1, data applications provide these services by embedding data platforms to process a wide variety of datasets, making this data actionable to customers and partners through a user interface layer.
In the following sections we’ll review five of the most common use cases of data applications. For each case we will highlight key data platform considerations, which we will then cover in more detail in Chapter 2.
The use cases we will cover are:
Applications in marketing or sales automation that require a complete view of the customer relationship to be effective. Examples include targeted email campaigns and generating personalized offers using historical and real-time data.
IoT (Internet of Things)
Applications that use large volumes of time-series data from IoT devices and sensors to make predictions or decisions in near real time. Inventory management and utility monitoring are examples of IoT data applications.
Application health and security
Applications for identification of potential security threats and monitoring of application health through analysis of large volumes of current and historical data. Examples include analyzing logs to predict threats and real-time monitoring of application infrastructure to prevent downtime.
Machine learning and data science
Applications focusing on the training and deployment of machine learning models in order to build predictive applications, such as recommendation engines based on purchase history and clickstream data.
Data-intensive applications that deliver branded analysis and visualizations, enabling users to leverage insights within the context of the application.
From clickstreams telling the story of how a user engages digitally to enriching customer information with third-party data sources, it is now possible to get a holistic, 360-degree view of customers. Bringing together data on customers enables highly personalized, targeted advertising and customer segmentation, leading to more opportunities to cross-sell and upsell. Both through better understanding of customers and by taking advantage of machine learning, you can create compelling experiences to drive conversion.
The challenge with Customer 360 applications is dealing with the massive amount and variety of data available. Basic data, including contact and demographic information, can be purchased from third-party sources. As this data tends to be stored in customer relationship management (CRM) solutions, it is typically well structured, available as an export at a point of time or via an API. Interaction data shows how a customer interacts with digital content. This can include tracking interaction with links in marketing emails, counting the number of times a whitepaper is downloaded, and using web analytics to understand the path users take through a website. Interaction data is typically semi-structured and requires more data processing to realize its value.
Realizing value from customer data involves bringing together the various data types to run analysis and build machine learning models. To support these endeavors a data platform needs not only to be able to ingest all these different types of data but also to gain insights from the available data through analysis and machine learning. We will talk about data platform needs in this area in “Machine Learning and Data Science”.
IoT data applications analyze large volumes of time-series data from IoT devices, sometimes requiring near-real-time analytics. Enabled by the confluence of widespread wireless connectivity and advances in hardware miniaturization, IoT devices have proliferated across multiple industries. From connected refrigerators to inventory management devices and fleets of on-demand bicycle and scooter rentals, the IoT has created an entirely new segment of data, with spending in this sector estimated to have reached $742 billion in 2020. 3
A smart factory offers some good of IoT data applications. 4 Real-time sensor data can be transformed into insights for human or autonomous decision making, enabling automated restocking when inventory levels dip below a threshold and visualization of operational status to monitor equipment health.
A theme in IoT applications is the need to both gather data and relay that information to be consumed by larger systems. IoT devices use sensors to gather data which is then published over a wireless connection. We have all experienced the patchy nature of wireless networks, with dropped calls and unreliable internet connections. These problems exist in IoT networks as well, resulting in some data from IoT devices arriving out of chronological order. If an IoT data application is monitoring the health of factory equipment it is important to be able to reconstruct the timeline to detect and track issues reliably.
In addition to supporting semi-structured data and the ability to efficiently order time-series data, a data platform supporting IoT use cases must be able to quickly scale up to service the enormous amount of data produced by IoT devices. As IoT data is often consumed in aggregate, creation of aggregates directly from streaming inputs is an important feature for data platforms as well.
It comes as no surprise that as the volume of data has grown rapidly, so has the ability to leverage data science to make predictions. From reducing factory downtime by predicting equipment failures before they occur to preventing security breaches through rapid detection of malicious actors, data science and machine learning have played a significant role across many industries. 5
As with the Customer 360 use case, data applications leveraging machine learning require ingestion of large amounts of different types of data, making support for data pipelines essential. Efficient use of compute resources is also important, as generating predictions from a machine learning model can be extremely resource-intensive. The elasticity of cloud-first systems (discussed in Chapter 2) can ensure that expensive compute resources are provisioned only when needed.
The development process for machine learning can benefit from significant amounts of data to construct and train models. A data platform with the ability to quickly and efficiently make copies of data to support experimentation will increase the velocity of machine learning development.
For data science and analysis, a data platform should support popular languages such as SQL to provide direct access to underlying data without the need for middleware to translate queries. External libraries for data analysis and machine learning can greatly streamline the process of building models, so support for leveraging third-party packages is also important.
Application health and security data applications analyze large volumes of log data to identify potential security threats and monitor application health. Many new businesses have been formed specifically around the need to process and understand log data from these sources. These businesses turn log data into insights for customers through application health dashboards and security threat detection. In the security domain, machine learning has improved malware classification and network analysis. 6
The ability to rapidly act on data is a critical feature of data applications in this area. Thus, real-time, fast data ingestion is a key requirement for data platforms supporting application health and security applications. Delays in surfacing data for analysis represent time lost for identifying and mitigating security issues. Often, triage involves looking back to observe events that led up to a security incident. Being able to time travel and observe data in a previous state can help piece together what led to a security breach.
Much of the data related to application health and security comes from log files. These can take up a significant amount of space, especially if you want to be able to time travel to previous versions. The ability to cheaply store this data while enabling analysis is another important data platform feature in this space.
The value of data applications in this space lies not only in enabling rapid identification of issues, but also the ability to act on findings when they occur. Integrating data applications with ticketing and alerting systems will ensure customers are notified in a timely fashion, and further integration with third-party services will allow for direct action to be taken. For example, if a data application monitoring cloud security identifies an issue with a compute instance, it could terminate it and then send an alert to the team indicating that the issue has already been taken care of.
Customers rely on data from the applications they use to drive business decisions. Embedded analytics refers to data applications that provide data insights to customers from within the application. 7 For example, a point of sale application with embedded demand forecasting provides additional value to customers beyond the primary function of the application. Leveraging application data to provide these additional services enables companies building data applications to generate new revenue streams by selling these extended services and thereby differentiate themselves from competitors.
Without embedded analytics, application users are limited in the value they can get from their data. They may request exports of their data, but this is inferior to an embedded experience due to the loss of context when data is exported from an application. Application users then must interact with multiple systems: the data application and third-party business intelligence (BI) and visualization tools. Customers must also contend with the additional cost and delay of storing and processing exported data. Instead, a data platform that supports embedding of third-party tools for data visualization and exploration will enable users to stay within the data application. This lets them work with fresh data and reduces overhead in supporting exports from the data application.
Because customers access embedded analytics on demand, it is not easy to predict usage. An elastic compute environment will ensure that you can deliver on performance service-level agreements (SLAs) during peak load, with the added benefit that you will not pay for idle resources when load subsides. Data platforms that can scale up and down automatically to meet variable demand patterns will offload this burden from the data application team. You will learn more about different approaches for scaling resources in Chapter 3.
Data platforms that support embedded analytics applications need support for standard SQL and the ability to isolate workloads. Support for standard SQL will enable embedding of popular BI tools, reducing demand on product teams to build these tools in-house. The ability to isolate workloads from different customers is important to prevent performance degradation.
Data applications provide value by harnessing the incredible amount and variety of data available to drive new and existing business opportunities. In this chapter we introduced data applications and five major use cases where data applications are making a significant impact: Customer 360, IoT, application health and security, machine learning and data science, and embedded analytics.
With an understanding of the key requirements in each use case, you are now ready to learn what to look for when evaluating data platforms.
6 For more on this topic, see Machine Learning and Security by Clarence Chio and David Freeman (O’Reilly).
7 For more on how Snowflake addresses the challenges product teams face when building embedded analytics applications, see the “How Snowflake Enables You to Build Scalable Embedded Analytics Apps” whitepaper.
Get Architecting Data-Intensive SaaS Applications now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.