Passive Capture is the process by which Tealeaf software captures the data that flows between your visitor's computer and your web servers.
The following list of terms apply to the passive capture process:
- Switch
- The switch is a hardware device that routes all incoming and outgoing data packets between your visitors' computers and your web servers. Typically, switches are configured using a hardware option called a https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/10570-41.html, which delivers a copy of every HTTP packet to the capture server.
- Packet
- The TCP/IP protocol organizes interaction between computers into packets. An individual Web page can be broken down into many packets, each transmitted individually between computers. The capture server typically monitors millions of packets traveling nearly simultaneously between your Web servers and visitors' computers. These packets can arrive in any order and sometimes must be retransmitted. The capture server can be configured to ignore packets that are not of interest, such as email messages or packets sent to IP addresses of servers not hosting the website.
- Request
- The HTTP protocol defines a request as a message requesting a response from one computer to another. The capture server collects all HTTP data to re-create the request and response traffic.
- Response
- A response is the return message to a computer, which has made a request. After capturing a request, the capture server then processes and assembles packets in search of the response to it.
- Hit
- A hit is defined as a request and the corresponding response to it.
After the hit has been collected, the Passive Capture software can scan the data to see if the hit is of interest. For example, images that are displayed on every web page are not very interesting and can be discarded. Also, sensitive information such as user names, passwords, and credit card numbers can be deleted.
After removing unwanted data, the Capture software securely transmits the hit data to the Processing Server.
- SSL
- Many website interactions are encrypted to protect the data from being read or manipulated by third parties. The Capture software has to decrypt the data in order to match requests and responses. Typically, the Capture software is configured to re-encrypt the software using SSL for transmission to the processing servers.
Stream data
Stream data is the HTTP DataStream captured by Tealeaf includes request, response, hit, session, and event data. This data is a sequence of digitally encoded signals (packets of data or data packets) that is used to transmit or receive information.
Tealeaf receives all packets copied by the switch and forwarded down the SPAN port to the CX Passive Capture Application server.
Stream data can be modified in the CX PCA server and in the Windows™ pipeline on the Processing Server. The data can be modified in these two places only.
Of the packets received by the CX PCA server, only the HTTP and HTTPS packets are re-assembled, processed, and forwarded for additional processing and storage. In most configurations, other types of packets are ignored.
Two modes determine what is processed:
- Business Mode:
Retains only specified file extensions (such as
.html
and.asp
) and encoding types. - Business IT Mode:
Retains all hits, including static objects.
Request
In Tealeaf, a Request is the component of data that originates from when user visits your website.
Typically, the raw request data is not displayed anywhere in the Tealeaf system, although for debugging purposes the system can be configured to show it. Instead, the REQ buffer is used to store meta data about the hit, including everything contained in the original request.
Each HTTP request (and only HTTP) that is captured causes the Passive Capture software to look for an HTTP response.
- If there is no request, any response is ignored.
- If the HTTP request is encrypted, Tealeaf must decrypt it to understand what was requested.
Request and response data can be manipulated in the PCA pipeline through privacy rules and in the Windows pipeline via deployed and configured session agents.
You can use the Request view when you replay a visitor session to view the data contained in the Request.
Request record
In addition to storing the raw HTTP request data in Tealeaf, the request record is used to store additional attributes for the hit.
These hit attributes are extracted or computed from the request or response and include information such as the IP address of the sender and receiver, performance timing, and form field variables.
This buffer is generated by the CX Passive Capture Application after the request and response have been captured.
The request record is an unstructured text blob containing multiple text delimited sections of either name=value
pairs or XML. The request record is always encoded as UTF-8. In addition to Tealeaf predefined sections, a custom [appdata]
section can be populated during processing by configured session agents, usually to simplify downstream data processing or evaluation.
The hit attributes can be used as source data for events and reports. Hit attribute data can be exported to third-party systems via cxConnect for Data Analysis
Sections of the Request Record
Hit attributes can be displayed in a list of sessions found through search, either in the Portal or the CX RealiTea Viewer.
The following table lists some of the request sections from which hit attribute data is extracted.
Request section | Description |
---|---|
[appdata] |
Custom attributes populated by session agents. These attributes are automatically indexed for search. |
[env] |
HTTP request environment variables such as the HTTP Referer and HTTP Status Code. |
[timestamp] |
Time stamp of the request and performance timing for the hit, calculated by Tealeaf. |
[urlfield] |
Parsed GET and POST data fields from the request. |
[TLFID_*] |
Fact information derived from the hit. |
Hit Attribute Export
Displays showing selected hit attributes for sessions that are found through search can be exported from the Portal and the Viewer into Excel.
The contents of the request record can be extracted from the Tealeaf system for import into other external systems by using ETL tools with the cxConnect for Data Analysis product.
Response
For each HTTP request, the corresponding HTTP response is also captured and, if necessary, decrypted. Typically only responses of content-type text/html are retained, except for the following items, which are also retained:
- Error code responses
- RIA requests (XML)
- Binary Files explicitly kept
HTML data is stored in the same encoding scheme as it was captured.
The response data can be viewed using session replay, either in the browser or the viewer. For HTML responses, a rendered view and a source view are provided.
Response data can be used for evaluating events.
Request and response data can be manipulated in the PCA pipeline through privacy rules and in the Windows pipeline via deployed and configured session agents.
Hits
Each request record/response pair is reassembled in Tealeaf to compose a hit.
Note: Tealeaf discards most hits as static objects that are not unique to a particular browser session. This approach is how Tealeaf keeps the data volume to a reasonable size.
For a typical web page, there can be many (20-50) hits, but most images, style sheets, JavaScript™ includes are discarded, so typically only a few hits per page are retained. For example, if the web server provides a HTTP Redirect response to the browser, which then fetches a page to display, the page that recorded two hits.
If a page contains JavaScript that requests XML data for display, there are at least two hits for the page.
Note: Tealeaf uses the term page to refer to hits that are not discarded.
Tealeaf hit counts often do not include every hit of the web server but sometimes do include more hits than page views recorded by other systems. While Tealeaf defaults to keeping hits of content-type = text/html
, it can also be configured to keep other data types such as dynamically created images.
The attributes of the hit are displayed in the request record.
After a hit is discarded, the hit does not exist in Tealeaf. However, its existence can be uncovered through replay, where the hit is regenerated, or by looking at the response HTML in some cases. A record of dropped hits is reported in the statistics that are generated by the CX Passive Capture Application.
Event
For hits that are not discarded, the data in the request record or response can be processed by the Tealeaf Event Engine. An event is defined as a trigger, a condition, and an action that is specified by an Tealeaf user.
A trigger is a defined moment in the lifespan of a session when events can be evaluated. Each event is associated with a specific trigger and can only reference the event-related data available in that trigger.
A condition is either the occurrence of a text string in the hit data or the combination of other Events occurring in the hit.
The action for a hit event is to record the Event Identifier, actual Hit Time, event value, and more. These items are stored in separate records with the hit data.
Events are useful for modeling user interactions with the web application and to represent those interactions in structured reports. Events are similar to web page tags yet are added dynamically based upon the DataStream.
Events also help to manage session recording. Data triggered events can be used to monitor session, request record, and response text, which provides the basis for the event conditions and the basis for recording.
Hit event data can be seen in multiple places. Counts for active events are shown in the Portal in the Active Events page and can be used to trigger alerts.
Event Conditions
The following are sample conditions that can be used for event definitions:
- Hit is received. Hit attribute is found in the request or response. Hit attribute can be defined as a specific string or as the content between two specified tags.
- Other events are processed. Session attribute value:
Exact Match
,Contains
,>
,<
, or a range of values. - Session End. Session attribute value:
Exact Match
,Contains
,>
,<
, or a range of values.
Event Actions
Based upon the event conditions, one or more of the following actions can be taken:
- Make the event searchable: First occurrence or every occurrence
- Make the event reportable: First occurrence or every occurrence
- Store the detected value as text or as a number
- Trigger another event
- Scrape text between tags
- Store session state when the event occurred as dimensions
- Identify membership in a list/group
- Update session attributes
- Send event data to an external system through the Tealeaf Event Bus
- Close session in Tealeaf
- Extend Tealeaf session timeout
- Discard session
Event Data
Definitions of the events are stored in a common database location to which all Canisters in the environment refer. Individual instance event data are stored with the session to which it is associated. Aggregated counts and average numeric values for events are recorded into a database.
Event data fields include:
- Session Key
- Hit Attributes: Key, Index, or other metadata
- Detected values:
- String that matched
- String that is bounded by the match pattern, such as
Name = Value
- String that is converted to a number (for example,
27.50
as shopping cart total) - String identifier of value that is defined in an enumerated list (for example, List of OS types)
- String identifier of group to which text found belongs (for example,
CA
belongs toWest
)
- Any reference dimensions that are associated with the event.
Event Data Export
Selected event data can be streamed across a TCP/IP socket to external systems in real time by using the Event Bus API for third-party analysis.
Session
In Tealeaf, a session is a series of hits between a specific browser and the web server, assembled to present a clear picture or representation of how a visitor interacted with the web site.
A typical session involves an individual user interacting with the web server to request (by sending an HTTP request) and retrieve (through the returned HTTP response) a series of web pages before leaving the site. These request/response pairings are stitched together into hits, and the sequence of hits in the session are stitched together to comprise the session data.
Sessions to which the visitor is continuing to add hits are known as active sessions. If the visitor is no longer adding hits for a predefined time period or triggers an action (such as logging out of the site), the active session may be closed. Sessions that have been closed are known as completed sessions.
Every hit belongs to a session. Sessions can fragment due to various factors.
Session Cookies
Since HTTP is a stateless protocol, Tealeaf requires a method of associating the hits of an individual session. In almost all deployments, this association is managed through a session cookie.
As each hit is received in the Processing Server (which manages the Canister) from the PCA server, the cookie is used to store the hit with previously captured hits. After no additional hits that are containing the cookie are received for the configured Idle Session Timeout period, the session is closed.
Session durations that exceed a preconfigured value can trigger closure.
Like a web server, Tealeaf cannot typically identify if hits are coming from different browser windows on the same requesting browser. Hits from different browser windows are integrated into the same session.
Session Fragmentation
Sessions can become fragmented. For example, the visitor can resume a session after a period of inactivity exceeding the timeout value. Even though the session cookie is the same, Tealeaf stores this visitor's experience as two session fragments. The following situations can cause session fragmentation:
- Tealeaf or web application timeout setting is exceeded
- Sessions that are stored across multiple data centers
- Sessions that are stored across multiple Canisters
- Large sessions can exceed maximum session size limits
- Poor sessionization
At search time, Tealeaf provides the ability to defragment such sessions. For replay and analysis of individual sessions, Tealeaf can connect the fragments.
Note: Reporting data indicates that session fragments are individual sessions. For example, the time gap between fragments may be longer than the reporting data collection interval.
Session Attributes
Through events, you can populate session attributes with specified values. These variables and their values can be found through search in the Portal or the RTV application.
Session event
For sessions that are completed, Tealeaf can process event conditions for the entire session, such as the occurrence of certain hit events during the session.
Event Definitions
Events are defined and managed through the Tealeaf Event Manager, which is accessible through the portal. By defining events, the user can model the workflow through the monitored website and create markers for search and report aggregation.
Canister data
Canister data is derived data that is created from the HTTP DataStream. Canister data is stored in several places and forms.
A properly configured Tealeaf system attempts to store all hits for a session in the same Canister, which is a daily collection of sessions that are processed by one Processing Server and the indexes that are associated with them.
Although most users do not see canister data, it can be viewed through the Active menu in the Portal or by searching for active sessions through the Portal or the Viewer. Completed Canister data can be viewed through search of completed sessions through the Portal or the Viewer.
Note: Active sessions are not indexed, but the data structures allow fast searching of some hit attributes. Tealeaf can perform full scans of the response data, but this method can slow system performance.
Canister data is typically retained for 10 days, after which it is erased to make room for newer data. The cxVerify product allows for search-based subsets of each Canister to be stored for longer periods in a different Canister.
Sessions can sometimes be fragmented across multiple Canisters. Tealeaf can defragment sessions across multiple Canisters for replay in the Portal.
The Canister is divided into two parts:
- The Short Term Canister contains active sessions, where hits are still being received as they occur. Hit event records are created now.
- Unless the Canister is spooling, a new hit added to an active session is available for search and review through the Portal in a matter of seconds.
- Search of active sessions is limited to full text search, which is slower than indexed search for completed session data.
- The Long Term Canister contains completed sessions, which are created when no hits are received for the Idle Timeout period or other trigger met. When a session is closed, all session events are processed. The session is recorded to disk, and the contents are indexed by a text search engine.
- An active session is rendered a completed session within approximately five minutes of the end of the session, unless the system is behind in indexing sessions.
- Completed sessions are collected into a set of LSSN files for the day.
- Aggregate counts for hit events and session events are collected and aggregated for the reporting database.
Dimensions
Dimensions are sets of reference data that can be captured and recorded when the event is triggered. Dimensions are associated with a defined event.
A dimension contains a set of values that are captured by a defined pattern or value recorded from an event. These values provide contextual information at the time when the event is recorded. They are stored in the request when the hit is processed by the Canister.
For example, Tealeaf provides the following reference dimensions. You can also define your own event dimensions.
- #*
Server
URL
Host Name
Application Name
If an event is associated with these dimensions, the values of these dimensions are recorded with the event when it is triggered. So, if an event is created to detect the presence of Status Code 500 errors in the response, the values of the above can be recorded with this event instance to facilitate debugging the issue.
Report Groups
Dimensions are organized into groups. A report group is a collection of dimensions. A dimension may belong to multiple report groups. When recording, the Tealeaf system collects aggregate counts for every combination of dimension values.
An event may be associated with multiple report groups.
Facts
When an event is triggered in a hit, the Report Group data is recorded with the event in a structure that is called a fact in the REQ record. A fact contains the recorded event value and any dimension values for associated report groups and other data.
This internal data structure is used to facilitate searching on dimensional data that are related to the recorded event. Each dimension instance value is hashed to provide a more easily indexed value for searching. When a string is input through the search interface (for example, "/DEFAULTPAGE"), the same algorithm is used to create a hashed value that can be found in the search index.
The following code sample is an example fact that is recorded in the [TLFID_80]
section of the request buffer:
[TLFID_80]
Searchable=True
TLFID=80
TLFactValue=1
TLDimHash1=38A7EF5D4FA961F712055D92FC56088A
TLDimHash2=BC3F1812E3C8837962A83226D4A30082
TLDimHash3=8606AC74FD2DECC1899004C49B226FAE
TLDimHash4=5E6D512952FFBB9673B1D0CB08EF33B0
TLDim1=/DEFAULTPAGE
TLDim2=WWW.TEALEAF.COM
TLDim3=OTHERS
TLDim4=63.194.158.200
In the above, the fact identifier (TLFID=80
) and recorded event value (TLFactValue=1
) are listed above the hashed values and plain text values for each dimension. Only the first 256 characters of the dimension value are recorded in plain text.
Index
In Tealeaf, an index is an arrangement of important words that appear in the HTTP DataStream. When a session is completed, it is written from the in-memory database (STC) to disk and marked for indexing.
Because retaining captured hit data is expensive in terms of disk space, for most Tealeaf deployments, only a subset of the captured hit data is indexed. Tealeaf indexes:
- The body of the response, without HTML tags.
- Selected sections stored in the request record.
- Selected event data such as the event identifier and event value.
Indexed data includes:
- Select data from the request record
[appdata]
,[urlfield]
, session attributes- Event Data (ID, value)
- Response
HTML/Headers are excluded
This index data is retained for the same length of time as the Canister data. The index data can be regenerated from the Canister data at any time.
Note: Depending on system load and configuration, canister data is typically indexed within 5 minutes of session completion.
A generated index cannot be viewed, although search results indicate the use of indexes. Using the same indexing algorithms as the Canister, the Viewer can create and display an index for the sessions that are currently loaded, although an exact match is not guaranteed.