The notion of data visualization has been around for years. Visualization provides a means by which you can gain insight into your data. For some, visualization can be a report that appears in their email inbox or on a web page. For others, it could be a grid of numbers with a chart in Excel. The key is to provide the correct, most efficient approach to allow users to view what there is to see.
Grids (such as tables and spreadsheets) are a simple and very powerful form of visualization, but when it comes to presenting results, they tend to take up a lot of real estate—so much real estate that you may need to scroll through subsets at a time. Because scrolling can be very time-consuming and tedious, important information may be overlooked, if ever even seen. Other kinds of visualization can enhance users’ ability to interpret high volume. For example, the scatter plot is ideal for revealing patterns and anomalies in large data sets.
Here, we will go into some detail on visualization—starting with the limitations of traditional grids for querying and reporting, and then showing some effective visualization means you might consider.
Limitations of the Grid Format
Not so long ago, people would run a report per a set of parameters and sift through the details. In some instances, data would be transferred to a spreadsheet for some additional analysis. Once in a spreadsheet, consumers could apply graphs to spot trends or color-coding to find outliers. With smaller amounts of data, these tasks were important, but fairly basic.
Many analyses can produce very large sets of data. Analysts often miss important information, because the sheer amount of data can be overwhelming. Thus, decision making suffers. Spreadsheets exacerbate this issue. Though spreadsheets are very user-friendly, the ability to store millions of cells by row and column inhibits the ability to see the entire data set at once.
It is far easier to spot trends, patterns, anomalies, and so on when you can examine the data in its entirety in a single view. If the data set is so large that you find yourself scrolling up and down and side to side, your analysis will suffer. It is virtually impossible to remember all of the data as you scroll. Enter the scatter plot.
From Grid to Scatter Plot
To meet the challenge posed by larger data sets, software vendors have begun creating new ways to query data that promote better visualization. Here, we compare results in a grid format with the same data in a scatter plot. The demonstration starts with a small data set made up of sales and marketing data for a year. After you see how that data set maps to a scatter plot, we add the Market and Product dimensions to increase the size of the data set. Important information, hidden in the detail of a grid, becomes readily apparent in the scatter plot.
Let’s get started. The above graphic shows results in a simple grid. In this case, Smart View was used to query an Essbase database and populate an Excel spreadsheet. Grids show values explicitly. For example, cell B2 reflects the value associated with Jan Sales: $31,538. The graphic below presents the same data in a scatter plot.
So, what is the difference? Well, instead of the 24 numbers shown in the grid, we see 12 data points in the scatter plot. The scatter plot still reflects all 24 values, except that the data is plotted by sales and marketing. This is a subtle but extremely powerful difference. Also, instead of showing the numeric value in each cell, a plot point represents the value as an intersection with each axis. One drawback is that, initially, it may be difficult to discern which data point is which. We will demonstrate techniques to address this issue in the next section. But first, let’s review how the scatter plot works. Then we will look at larger examples using the same size scatter plot object.
Creating plots in a scatter plot is a matter of simple geometry. In our example, each data point reflects a given month. So, to find the data point for sales and marketing for January, locate 31,538 on the x axis (Sales) and draw a line upward, parallel to the y axis. Locate 5,223 on the y axis (Marketing) and draw a line to the right, parallel to the x axis. As shown in the graphic below, the data point is found at the intersection of the two lines. The lines are sometimes referred to as drop lines.
Now that you have the basics, let’s expand the data set to include market data. The graphic below shows that as the domain size increases, the grid takes up more space. For each market, we need two rows by 12 columns to present 24 values.
With 20 markets, we need 40 rows by 12 columns to present 480 values. In a scatter plot, the 20 additional markets are presented as 240 marks in the same space used before we introduced the additional markets, as shown below.
When we add the Product dimension with its 12 members to the mix, the spreadsheet gets even larger and more unwieldy, while the scatter plot simply adds more data points. The graphic below shows the scatter plot. (The spreadsheet version is too large to show.)
It may be useful to look at the progression of the scatter plots we have discussed. The graphic below demonstrates the increased data points plotted as we add 20 markets, followed by 12 products. Throughout, the footprint used remains the same. In contrast, the number of rows in the spreadsheet increases tremendously as we add markets and products.
More interesting though is the information that can be gleaned by a quick glance. The rightmost scatter plot in Figure 6-33 has some stunning information. What appeared to be a consistent pattern has been broken, yielding very interesting data in both the upper-left and lower-right portions of the result. We refer to these anomalies as outliers , and their location in the scatter plot is critical. In this case, some of the outliers are good news and some are potentially bad news.
Considering that our analysis compares sales to marketing, the outliers found at the lower right are good news. These data points represent our ability to sell with little to no marketing expense. In contrast, the outliers found at the upper left warrant further investigation, because the opposite is true—the company is spending money on marketing with little return in sales. These outliers may be understandable; for example, a new product entering the market could have this kind of profile. If the product has been around for some time, however, this result is cause for attention.
Improving the Presentation of Data in a Scatter Plot
One challenge presented by the scatter plot is visibility. For example, it is hard to tell which market-by-product-by-time combination is associated with a given data point. To solve this problem, we can associate dimensional information with any combination of color, shape, size, and/or text. Color, shape, and size generally provide the most value. Text is nice to use with smaller sets of data. A legend is also important to aid comprehension. Another useful technique is to make the values along the axes suit the range of values contained in the chart. Let’s take a look at a few examples.
The graphic above presents a side-by-side view of our very first scatter plot—the one showing sales and marketing by month—with text labels added to each data point. The left chart reveals that, because the data points are clustered so close together, much of the text cannot be displayed. The solution, as shown in the right chart, is to use more appropriate axes values in order to provide the space necessary to display the text labels.
As you might guess, the challenge with text for larger data sets is that text uses a good deal of space. In essence, for larger data sets, the problem is the same one that led to using a scatter plot in the first place: lack of real estate. So, let’s look at some other techniques for improving comprehension using larger data sets.
In the next example, we turn our attention away from months and focus on market. The graphic below contains the scatter plot originally shown above (sales and marketing by month by market by product) with color assigned to the Market dimension. A legend maps a color to each of the markets.
We can quickly determine a pattern in the data after color is applied. The cluster of good outliers (lower right) is from Massachusetts, while the two clusters of poor outliers are from Nevada (leftmost) and New York State, respectively. We have figured out that our problems are isolated to particular markets. This is very helpful information that would be a challenge to discover using a spreadsheet.
Now, by applying shape to product data in the graphic below, we can see other patterns emerge. Particular products are associated with the good and poor outliers.
Referring to the legend, which contains a subset of information, we discover that Cola is the product that is selling very well in Massachusetts (good outlier). The poor performing product is Old Fashioned in New York State. With the information gleaned from increased visibility, we can contact the product managers for those markets and determine a course of action.
Up to this point, we have used two measures, Sales and Marketing, on the x and y axes. Each data point represents the intersection of the values associated with these measures in terms of months, market, and products. We then highlighted markets with color and products with shapes. Now we are going to layer on yet another dimension member: ending inventory. We will use size to represent the value of the ending inventory—the larger the plot point, the greater number of units in inventory. The graphic below confirms that in the poorest performing area (upper left), the data points are indeed larger than those found in the best performing area (lower right).
From the above graphic, we can conclude that we may be able to apply production capacity from the products at the upper left to the more profitable products at the lower right. After all, we are producing more product than we can possibly sell. Also, if this excess inventory could be sold in the markets found at the lower right, a simple distribution change would help. In either case, having the information quickly provides us with the power to make a better decision.
The preceding examples demonstrated some basic functionality shared by many different visualization tools. We showed how properties such as shape, color, size, and text can be used to identify the data points and reveal patterns within the data. Most of what we demonstrated had to do with outliers. These values are simply anomalies that appear outside a typical pattern. Visualizations such as the scatter plot are great for this type of analysis.
Other Types of Visualizations
In general, visualizations should provide the most appropriate method for presenting the data, given the analytic need. Our intent was to present an alternative approach to the grid and chart-based methods. If you are simply after a report comparing this year to last, a grid or bar chart may fit the need. However, when heavy data volumes or no clear-cut output is needed—meaning the exercise is more analytic than static—other visualizations are a better fit.
- Many other types of visualizations are possible, including the following:
- Bar charts are a great way to compare data across categories or to break data down into stacked bars.
- Text tables (also called cross-tabs or pivot tables) provide an easy way to display the numbers associated with categorical data.
- Line charts connect individual data points in a data view. They provide a simple way to visualize a sequence of values, and they are especially useful when you want to see trends over time.
- Heat maps are a great way to compare categorical data using color. They are typically constructed as a table using colored squares to represent the data and a continuous range of colors. Heat maps allow you to see variations in the data via variations in color.
- Gantt charts are typically used when you want to display the duration of one or more categories of interest against the progression of time.
In this writing, we have covered the wide variety of functionality that you may want in a visualization tool. Whichever tools you choose, ensure that you are getting the features you need to produce the types of ad hoc reports and custom reports that your users want.