What is data exploration?
Data exploration involves looking at different data sets to identify and catalog their key characteristics. It's the critical first step in full-fledged data analysis, before the data is run through a model—as such, it's sometimes called exploratory data analysis (EDA).
The final exploration of a data set is always done by a data analyst or data scientist, but analytics tools help make the process more efficient and dynamic.
Why is exploration important to data analysis?
Data exploration allows data analysts to develop a general but valuable understanding of individual data sets before delving into the true nitty-gritty of analysis and interpretation. Factors examined include size, accuracy, the presence of patterns, and correlation to other relevant data trends or data sets.
Insights made during data exploration help determine the value of recently ingested data sets and identify relationships that may be connected to data analytics trends.
For example, if an analyst detects a pattern in a data set that seemingly indicates the growing profitability of a product, the analyst knows that the set merits the in-depth examination that's possible with advanced analytics. By contrast, a data set full of missing values, duplicates, or redundancies isn't ready for the analytics process.
Key purposes of data exploration include the following:
Examining categorical variable characteristics
Imagine a data set containing details of various laptop computers. Each of these features gets its own categorical column—brand name, size, color, processor manufacturer, hard drive space, screen type, and so on. The categories with the lowest number of unique values are processor manufacturer and color.
Just like that, the exploration techniques of unique value count and frequency count have identified that laptops often come in shades of gray or black and are commonly powered by either Intel or AMD processors. On the other hand, there are dozens of different brand names, giving this category the most unique values. Taken together, this suggests branding is the biggest distinguishing factor between laptops—albeit not definitively, as that will require deeper analysis.
Most large data sets will be far more complex than the hypothetical above, potentially featuring hundreds or even thousands of distinct categorical columns. Modern enterprises often use tools driven by artificial intelligence (AI) or machine learning (ML) that aid the exploration of these massive data volumes. But the purpose of this part of the exploration process remains the same regardless of scope—searching for variables that stand out for various reasons.
Finding correlations, anomalies, and more
Correlations—situations in which a variable's behavior depends on another variable—are among the most frequently examined aspects of data sets in data exploration. These key relationships in data sets may eventually illuminate greater truths about the business.
Returning to the laptops data set, imagine that it also contains sales and customer data. It reveals that sales of laptop models with hard drive space under 250 GB are highest among customers between the ages of 22 and 28, and that sales figures in this demographic drop for models with greater disk space. That correlation doesn't illustrate these buyers' preferences as a whole—because correlation doesn't imply causation—but knowing about it will help direct further analysis.
Other important pieces of information that can be found during data exploration include the following:
- Homogeneity of variance: Determining whether your data set has homogeneity of variance—i.e., the variances of independent groups on continuous variables are very similar or equivalent—is a good gauge of its reliability. If the set lacks this quality, it's probably due to outliers.
- Outliers: These values will be either notably larger or smaller than the majority of entries in their categories. Finding outliers in the exploration phase is critical. They may adversely affect data modeling if they stem from errors in data collection, leading to inaccurate conclusions. The presence of outliers also lets analysts know that comprehensive analysis of the data set may be a more complex task than they originally expected when data exploration started.
- Missing values: It's important to find null values in a data set before putting the set through any sort of statistical model and work to find the missing data. If left unaddressed, missing data values will damage the credibility of reporting, and prevent analysts and other business users from deriving actionable insights from it.
- Skewing: In a skewed data distribution, there will be some significant deviation from normal distribution—e.g., a set with a mean value notably larger or smaller than its median or mode. Skewing within a data set is more troublesome than a small handful of outliers, and may require running a logarithmic transformation to identify patterns and put analysts back on the road to finding valuable insights.
Bringing data to life with visualization
Data visualization has always been crucial to data exploration. "Visualization" can refer to scatter plots and histograms—i.e., graphs—or cutting-edge interactive visuals. The vast majority of data analytics platforms feature a visualization tool as one of their key selling points.
Visualization brings data to life using shapes and imagery that the human brain instinctively understands more readily than the rows of values in a basic structured data table. Analysts and data scientists can more quickly identify patterns and outliers, and get to the more in-depth phases of data modeling and analytics faster.
Data visualization also helps non-expert business users more readily embrace the value of analytics without requiring significant training. Innovations in this area or related to it will have notable roles to play in the future of data analytics. Examples include real-time visualization, which lets users update charts in real time, and AI- or ML-driven visualization tools that can generate graphics based on spoken or typed natural language queries.
Critical data exploration methods
Although many data exploration processes are automated within a typical data exploration tool or platform, it's still worthwhile to examine their core tenets. These are some of the most common methods.
Univariate, bivariate, and multivariate analysis
These methods involve investigating variables individually, in terms of binary relationships, or across multiple categories. The goal of univariate analysis is determining the spread or distribution of values within categories—based on whether you're looking at a continuous or categorical variable, respectively. Meanwhile, bivariate and multivariate analysis use variable relationships to look for correlations that lead to probability inferences.
Missing value and outlier treatments
Both of these data exploration techniques focus on addressing deviations from a data set's norm.
Missing values can be filled in with imputation—using the mean or median value in a category to project what a missing figure might have been. Alternatively, portions of the data with a significant number of missing values can be deleted, an approach most often seen when there's no clear pattern to identify which values are absent—and deletion wouldn't violate an organization's data management policies.
The right approach to outliers depends on what caused them. For example, if it becomes clear there were errors during data collection or extraction, deleting outliers may be the best move. Otherwise, outliers can be imputed, transformed through logarithms or binning, or treated separately from the original data set.
Although one of the simplest data exploration techniques, histograms are still immensely valuable—they're just more likely to be computer-generated than hand-drawn these days. They allow for deep dives into individual categories within a data set and are particularly useful for finding skewed distributions.
Sometimes called the "80-20 rule," Pareto analysis in data exploration involves looking at where the majority of values in a category lie, specifically broken down by a factor of 80% to 20%. The 80% represents the most common values, while the 20% stands in for the set's uncommon values.
Returning once more to the laptops data set, a retailer's data team could use Pareto to determine that the majority of laptops—the 80%—are priced under $1,500. This tells analysts many things, notably that they should look at commonalities among the products in the 20% segment.
Data management and analytics tools for better exploration
Optimal data exploration doesn't only depend on obtaining the right data sets and knowing the best exploratory analysis techniques. It's also critical to have the right data management and analytics solutions, as they'll be important during exploration as well as more advanced steps of the process.
A platform like Teradata Vantage with data warehousing and advanced analytics capabilities will be particularly valuable for properly exploring structured data sets. Vantage can ingest and integrate data from any source within the enterprise, whether cloud-hosted or on-premises, regardless of format. Additionally, the solution's Analytics Library package has built-in functions for data preparation, exploratory analysis, hypothesis testing, descriptive stats, and other critical aspects of data exploration.
To learn more about Vantage's flexibility for data exploration projects, check out the video tutorials for Analytics Library. Or find out about how Vantage helps data scientists and data analysts use Python and Jupyter Notebook more effectively.
Learn more about data exploration