When dimensions are created to capture data using Whitelist + Observed
Values
, the volume of data can grow without limit, which result in a very large number of either individual dimension values or combinations of values with other dimensions.
- Data management issues are most acute for
Whitelist + Observed Values
dimensions. - Because
Whitelist + Observed Values
is the easiest way for a dimension to immediately gather values without additional configuration,Whitelist + Observed Values
is the default method of recording for dimensions. As a result, however, data management issues can occur for any newly created dimension. - In addition to data management issues, a high number of dimension values can become a bottleneck in the Data Service and the Portal, which attempt to pass these values upon request.
High-volume dimensions
Data management of dimensions is especially important for dimensions that capture a high volume of values, such as URL. Implementation of specific instructions for managing URL and other high-volume dimensions can help to prevent runaway database growth.
Storage of dimension values
When dimensions are created, by default they are defined with the following settings:
- Values to Record:
Whitelist + Observed Values
- Max Values Per Hour:
1000
The above configuration means the following:
- Any observed value is saved to the dimension and thus is stored in the database.
- For each Canister, up to 1000 unique values can be detected and recorded per hour. The set of unique values is cleared each hour.
Under these settings, data volumes can grow very quickly for the following reasons:
- The uniqueness of values is reset each hour. The list of unique, observed values in the Canister is reset each hour. Once per hour, the observed values are collected from the Canister and written into the database, and the list of values that are known to the Canister is cleared. As a result, the total number of values can be greater than the unique values per hour. For example, if the limit is 1000 unique value per hour and in hour 1 you have 1000 unique values and in hour 2 you have a completely different set of 1000 unique values, you will have 2000 unique values.
- The limit to the number of values is applied per Canister. Each Canister is permitted to capture 1000 unique values per hour. In an environment with 20 Canisters, the maximum potential number of values is 20,000 per hour by default. While it is unlikely that each Canister captures 1000 unique values, it is important to remember that the data grows based on the number of Canisters capturing values.
- The default capture limit may not seem to be large, but over time it can be. In a one-Canister environment that is configured to capture 1,000 unique values per hour, the total number of values that could be captured in a day is 24,000 values. The number of values that are written to the database is then multiplied by the number of Canisters in the environment.
The above data management issues are most significant in dimensions that are configured to use Whitelist + Observed Values
. Since observed values are typically unfiltered or are highly dynamic, maximum permitted values for each hour can be reached quickly.
Note: When you create a dimension, immediately convert it to a Whitelist Only
dimension and then using the recommended workflow to populate the dimension. This workflow is especially important for high-volume dimensions.
Recommended workflow
For high-volume dimensions, the following section provides a recommended workflow for populating the dimension with a data set that maintains data integrity while limiting database growth.
- If possible, validate the data before creating the dimension.
For some dimensions, the data is already recorded in the request. For example, the
TLT_URL
value is automatically inserted by the Tealeaf Reference session agent, which is included and enabled in the default pipeline configuration. URL normalization is enabled by default, too.For other high-volume dimensions that extract from request or response data, you may want to verify that the data is being appropriately captured in a session through replay before you create the dimension. For example, you can search for specific event values or indexed request/response data.
If the values do not appear to be recorded properly, check if they are being inserted by Tealeaf or your web application:
- If the data is inserted by Tealeaf, verify that the appropriate component is inserting the data. Data may be inserted by the PCA, Canister, or event that is defined in the Event Manager.
- If the data is inserted by your web application, verify the data with your web development team.
- Create the dimension.
- Make sure to set the Values to Record to be
Whitelist Only
. - You might want to adjust the Max Values Per Hour as needed.
Processed values include whitelisted values, which also count against this limit. Blacklisted values do not count.
Note: For testing purposes, you may want to add this dimension to a report group that is associated with an event that occurs in each session. Later, through the Report Builder, you can create a simple report with the event + dimension combination to review the captured values.
- Enable logging of values for the dimension. Dimension logging enables the capture of observed values for purposes of downloading and creating your whitelist. These values are captured in logs that are stored in the database, which are automatically cleared after a period of days.
- Make sure to set the Values to Record to be
- Let the log fill with a sufficient volume of values to be a meaningful cross-section of activity. For a high-volume dimension, you may have a representative data set by waiting a single hour.
Note: A downloaded log file can contain up to the top 250,000 values by occurrence over the duration that they were collected in the logs.
- Edit the log values to be your first pass at the whitelist.
- Download the logged values to your local desktop.
- Load the values into Microsoft™ Excel. Sort them based on the occurrences.
- You can decide the top number of values to insert into your whitelist. You should copy and paste these values to a separate XLS sheet.
Retain the file that you used to upload for recordkeeping.
Note: A whitelist can contain up to 5,000 values.
- Load the values into your whitelist through the Dimension editor.
- Monitor the captured values.
- After you loaded the dimension values into the whitelist, all subsequent observed values are checked against the whitelist.
- If the Maximum Number Per Hour of values is exceeded, an instance of the
[Limit]
value is recorded for the dimension. - If an observed value does not appear in the whitelist and the Max Number Per Hour of values was not exceeded, an instance of the
[Others]
is recorded for the dimension.
- Through the Report Builder, create a report:
- Add an event that occurs each session.
- Add the dimension, which should be available if you added it to a report group associated with the event.
- Each hour, you can track the count of occurrences of the
[Others]
and[Limit]
.
- Periodically, you should download a new set of log values and compare it to the set that you saved.
- Look for logged values that have a number of occurrences greater than 1 and that do not appear in the whitelist. These values should be added.
- Look for values in the whitelist that do not appear in the set of logged values. These values should be removed.
- In Microsoft Excel, the
VLOOKUP
function can be used to check the contents of one worksheet against another. For more information, see the documentation available inside Microsoft Excel.
Note: If there are significant changes to your web application, your dimension whitelists are likely to need rebuilding. Contact your web application development team for details on the changes.
When the values appear to stabilize, you can turn off logging of values.