HTTP Collector (Pull)

Http Collector utilizes pull-based mechanism which provides a robust framework and offers advantages in flexibility, reliability, and scalability. In a pull-based setup, the monitoring server sends HTTP requests to the endpoints exposed by the monitored applications. These endpoints return the current metrics, allowing the monitoring system to collect real-time data.

Purpose

The purpose of the HTTP Collector (Pull) source in Observo AI is to enable the platform to actively retrieve data from HTTP/S endpoints, such as Bulk APIs, by polling or making periodic requests to the specified URL. It pulls data (e.g., logs, metrics, or events in formats like JSON, CSV, or plain text) from external systems or applications into Observo AI for analysis and processing. This integration supports streamlined data pipelines, real-time monitoring, and analytics, allowing organizations to enhance observability, security, and data-driven decision-making by proactively fetching data from configured HTTP/S sources.

Prerequisites

Before configuring the HTTP Collector (Pull) source in Observo AI, ensure the following requirements are met to facilitate seamless data ingestion:

  • Observo AI Platform Setup:

    • The Observo AI platform must be installed and operational, with support for the HTTP Collector (Pull) as a data source.

    • Verify that the platform supports common data formats such as JSON, CSV, or plain text. Additional formats may require specific parser configurations.

  • HTTP/S Endpoint Access:

    • An active HTTP/S endpoint such as a Bulk API must be available to send data to Observo AI.

    • Obtain the endpoint URL and any required authentication credentials such as API key, token, or basic auth username/password from the data provider.

  • Authentication:

    • Prepare one of the following authentication methods:

      • API Key: Obtain an API key or token from the data provider for secure access.

      • Basic Authentication: Provide a username and password for HTTP basic auth, if required.

      • Secret Authentication: Use a stored secret within Observo AI's secure storage for credentials.

  • Network and Connectivity:

    • Ensure Observo AI can communicate with the HTTP/S endpoint such as api.example.com.

    • Check for proxy settings, firewall rules, or VPC endpoint configurations that may affect connectivity to the external endpoint.

Prerequisite
Description
Notes

Observo AI Platform

Must be installed and support HTTP Collector (Pull)

Verify support for JSON, CSV, etc.; additional parsers may be needed

HTTP/S Endpoint

Active endpoint for data submission

Obtain URL and credentials from the data provider

Authentication

API Key, Basic Auth, or Secret

Prepare credentials as required by the endpoint

Network

Connectivity to the HTTP/S endpoint

Check VPC endpoints, proxies, and firewalls

Integration

The Integration section outlines the configurations for the HTTP Collector (Pull) source. To configure the HTTP Collector (Pull) as a source in Observo AI, follow these steps to set up and test the data flow:

  1. Log in to Observo AI:

    • Navigate to the Sources tab.

    • Click the Add Source button and select Create New.

    • Choose HTTP Collector (Pull) from the list of available sources to begin configuration.

  2. General Settings:

    • Name: A unique identifier for the source, such as http-collector-source-1.

    • Description (Optional): Provide a description for the source.

    • Endpoint: HTTP endpoint to collect data from. Supports templating with $LAST_VALUE$ when using checkpointing.

      Examples

      http://logs.example.com

      http://example.com/logs?since=$LAST_VALUE$

    • Collection Internal (Optional): Duration between consecutive data collection requests. Default: 1m

      Examples

      10s

      1m

    • Headers (Add as needed): Headers to include in the HTTP request. Use the format {key: value}.

      Key (Default)
      Value (Default)

      Netskope-Api-Token

      <API Token>

  3. Authentication (Optional):

    • Client ID: Client ID for OAuth2 authentication.

    • Client Secret: Client secret for OAuth2 authentication.

    • Token URL: URL to get the OAuth2 token.

    • Scopes (Add as needed): Scopes to request for OAuth2 authentication.

    • Headers (Add as needed): Headers to include in the Oauth2 authentication HTTP request. Use the format {key: value}.

  4. Checkpoint:

    • Enable Checkpoint (False): Enable incremental log collection using checkpointing.

    • Tracking Column: JSON path to the field used for tracking progress, such as 'timestamp'. The value from the last log entry will be used.

      Examples

      timestamp

      message.time

      data.created_at

    • Initial Value: Starting value for the tracking column. Will be used for the first collection.

      Example

      2024-01-01T00:00:00Z

  5. Pagination:

    • Enable Pagination (False): Enable pagination support for handling paginated responses.

    • Maximum Pages: Maximum number of pages to retrieve in one collection cycle. Set to 0 for unlimited. Default: 50

      Example

      50

      100

      0

    • Request Interval: Time to wait between pagination requests. Use a duration string like '100ms' or '1s'.

      Examples

      100ms

      500ms

      1s

    • Pagination Type (Empty): Type of pagination to use. Default: Page-Based

      Select from dropdown:
      Description

      Page-Based

      For traditional page numbers

      Attribute-Based

      For cursor or token-based pagination

      Cursor-Based

      For unique pointer to retrieve next records

    • Page Parameter Name: Query parameter name for the page number. Default: page Example: 'page' results in ?page=1

      Examples

      page

      page_number

      pageNum

    • Size Parameter Name: Query parameter name for the page size. Default: size Example: 'size' results in ?size=50

      Examples

      size

      limit

      page_size

    • Page Size: Number of records to request per page. Default: 50

      Example

      50

      100

      200

    • Start Page: Page number to start pagination from. Works in conjunction with zero-based setting. Default: 0

      Example

      0

      1

    • Maximum Pages: Maximum number of pages to retrieve in one collection cycle. Set to 0 for unlimited. Default: 50

      Example

      50

      100

      0

    • Total Pages Path: JSON path to total pages count in response. Example: 'meta.total_pages' for {"meta": {"total_pages": 5}}

      Examples

      meta.total_pages

      pagination.pages

      page_info.total

    • Total Count Path: JSON path to total pages count in response. Example: 'meta.total' for {"meta": {"total": 150}}

      Example

      meta.total

      pagination.total_records

      count

    • Zero-Based Indexing (False): If true, page numbering starts at 0. If false, it starts at 1.

  6. TLS Configuration (Optional):

    • CA File: The CA certificate provided as an inline string in PEM format.

    • ​​Include System CA Certs Pool (True): Include the system CA certificates pool in the list of CAs used to verify the server certificate.

    • Cert File: Path to the TLS cert to use for TLS required connections.

    • Key File: Path to the TLS key to use for TLS required connections.

    • Insecure (True): Skip TLS verification when connecting to the endpoint. This is insecure and should not be used in production.

    • Insecure Skip Verify (True): Enable TLS but not verify the certificate.

    • Server Name Override: The server name to use to verify the hostname on the returned certificates.

  7. Advanced Settings (Optional):

    • Proxy URL: URL of the proxy server to use when connecting to the endpoint.

    • Read Buffer Size: Size of the read buffer in bytes.

    • Write Buffer Size: Size of the write buffer in bytes.

    • Timeout: Timeout for the HTTP request. Use a number followed by a unit, such as '30s' or '1m'. Default: 10s

    • Compression: Compression algorithm to use for the request body. Select one.

    • Max Idle Connections: Maximum number of idle connections to keep open to the endpoint.

    • Idle Connection Timeout: Timeout for idle connections to the endpoint. Use a number followed by a unit, such as '30s' or '1m'.

    • HTTP 2 Read Idle Timeout: Timeout for HTTP/2 read idle connections to the endpoint. Use a number followed by a unit, such as '30s' or '1m'.

    • HTTP 2 Read Ping Timeout: Timeout for HTTP/2 read ping connections to the endpoint. Use a number followed by a unit, such as '30s' or '1m'.

    • Method: HTTP request method to use for requests. Supports GET and POST methods. Default: Get

    • Body: Request body for POST method. Supports templating with $LAST_VALUE$ when using checkpointing.

      Example

      {"query": "fetch_logs", "from": "$LAST_VALUE$"}

    • Response Log Path: JSON path to logs array in responses. Leave empty if the response is a direct array of logs.

      Examples

      data

      resource.logs

  8. Parser Config:

    • Enable Source Log parser: (False)

    • Toggle Enable Source Log parser Switch to enable

      • Select appropriate Parser from the Source Log Parser dropdown

      • Add additional Parsers as needed

  9. Pattern Extractor:

    • Refer to Observo AI’s Pattern Extractor documentation for details on configuring pattern-based data extraction.

  10. Archival Destination:

    • Toggle Enable Archival on Source Switch to enable

    • Under Archival Destination, select from the list of Archival Destinations (Required)

  11. Save and Test Configuration:

    • Save the configuration settings in Observo AI.

    • Send sample data to the HTTP Collector (Pull) endpoint and verify ingestion in the Analytics tab for data flow.

Example Scenarios

ConnectSphere Telecom, a fictitious regional telecommunications provider offering mobile, broadband, and enterprise connectivity services, relies on a third-party customer management system that exposes a Bulk API for retrieving customer activity logs, such as call records, data usage, and billing events. To improve operational visibility and customer experience analytics, ConnectSphere integrates this API with the Observo AI platform using the HTTP Collector (Pull) source. This integration enables real-time ingestion of JSON-formatted log data for monitoring network usage patterns, detecting service anomalies, and generating usage reports for regulatory compliance.

Standard HTTP Collector (Pull) Source Setup

Here is a standard HTTP Collector (Pull) Source configuration example. Only the required sections and their associated field updates are displayed in the table below:

General Settings

Field
Value
Description

Name

connectsphere-http-collector

Unique identifier for the HTTP Collector (Pull) source.

Description

Ingest customer activity logs for usage and anomaly monitoring

Optional description of the source's purpose.

Endpoint

https://api.connectsphere.com/logs?since=$LAST_VALUE$

HTTP endpoint for customer activity logs with checkpoint templating.

Collection Interval

30s

Data collection every 30 seconds for near real-time monitoring.

Headers

{ "X-Api-Key": "$API_KEY$" }

HTTP header with API key for secure access (key stored securely).

Authentication

Field
Value
Description

Client ID

cst_oauth_client_2025

Client ID for OAuth2 authentication with the API.

Client Secret

kF93nXNq8z5C9TdYpWv9G6LzXfG2sT7b

Client secret for OAuth2 authentication (securely stored).

Token URL

https://api.connectsphere.com/oauth/token

URL to obtain OAuth2 token for authentication.

Scopes

logs:read, analytics:read

Scopes requested for accessing log and analytics data.

Headers

{ "Content-Type": "application/json" }

Headers for OAuth2 authentication HTTP request.

Checkpoint

Field
Value
Description

Enable Checkpoint

True

Enables incremental log collection to avoid duplicate data ingestion.

Tracking Column

data.created_at

JSON path to the 'created_at' field for tracking progress of log collection.

Initial Value

2025-07-09T00:00:00Z

Starting timestamp for the first collection cycle.

Pagination

Field
Value
Description

Enable Pagination

True

Enables pagination to handle large datasets from the API.

Pagination Type

Cursor-Based

Uses cursor-based pagination for unique pointer-based data retrieval.

Maximum Pages

100

Limits retrieval to 100 pages per cycle to manage API load.

Request Interval

150ms

150ms delay between pagination requests to avoid rate limiting.

Page Parameter Name

cursor

Query parameter name for cursor (e.g., ?cursor=abc).

Size Parameter Name

limit

Query parameter name for page size (e.g., ?limit=100).

Page Size

100

Requests 100 records per page for efficient data retrieval.

Start Page

0

Starts pagination from cursor 0.

Zero-Based Indexing

True

Page numbering starts at 0 for cursor-based pagination.

Total Pages Path

meta.total_pages

JSON path to total pages count (e.g., {"meta": {"total_pages": 10}}).

Total Count Path

meta.total_records

JSON path to total records count (e.g., {"meta": {"total_records": 1000}}).

TLS Configuration

Field
Value
Description

CA File

-----BEGIN CERTIFICATE-----...

Inline PEM-formatted CA certificate for verifying the API server.

Include System CA Certs Pool

True

Includes system CA certificates for broader certificate validation.

Cert File

/path/to/tls-cert.pem

Path to the TLS certificate for secure connections to the API endpoint.

Key File

/path/to/tls-key.pem

Path to the TLS key for secure connections to the API endpoint.

Insecure

False

Disables insecure connections (TLS verification is enforced).

Insecure Skip Verify

False

Ensures TLS certificate verification is performed.

Server Name Override

api.connectsphere.com

Specifies the server name for verifying the hostname on certificates.

Advanced Settings

Field
Value
Description

Proxy URL

http://proxy.connectsphere.com:8080

Proxy server URL for routing API requests through the corporate network.

Read Buffer Size

32768

32KB read buffer size for efficient data handling.

Write Buffer Size

32768

32KB write buffer size for efficient data handling.

Timeout

45s

45-second timeout for HTTP requests to handle network latency.

Compression

Deflate

Uses Deflate compression for request body to reduce bandwidth usage.

Max Idle Connections

15

Limits to 15 idle connections to optimize resource usage.

Idle Connection Timeout

90s

Closes idle connections after 90 seconds to free resources.

HTTP 2 Read Idle Timeout

90s

90-second timeout for HTTP/2 read idle connections.

HTTP 2 Read Ping Timeout

30s

30-second timeout for HTTP/2 read ping connections.

Method

GET

HTTP GET method used for requests to the API.

Body

No request body required for the GET method.

Response Log Path

resource.logs

JSON path to the logs array in the API response (e.g., {"resource": {"logs": [...]}}).

Troubleshooting

If issues arise with the HTTP Collector (Pull) source in Observo AI, use the following steps to diagnose and resolve them:

  • Verify Configuration Settings:

    • Ensure all fields, such as URL, Auth Type, and parser settings, are correctly entered and match the data provider's setup.

    • Confirm the HTTP method such as POST aligns with the endpoint's requirements.

  • Check Authentication:

    • Verify the authentication method:

      • For API Key, ensure the key or token is valid and not expired.

      • For Basic Auth, check that the username and password are correct.

      • For Secret Authentication, confirm the secret is accessible in Observo AI's secure storage.

  • Validate Network Connectivity:

    • Check for firewall rules, proxy settings, or VPC endpoint configurations that may block access to the HTTP/S endpoint.

    • Test connectivity using tools like curl or Postman with similar proxy configurations to verify access.

  • Common Error Messages:

    • "Inaccessible host": May indicate TLS version mismatches such as TLS 1.3 issues or DNS problems. Ensure the host supports the required TLS version and check DNS settings.

    • "Authentication failed": Verify that the API key, username, or password is correct and has the necessary permissions.

    • "Request timeout": Check the Timeout Secs setting and network latency; consider increasing the timeout value.

  • Monitor Logs and Data:

    • Verify that data is being ingested by monitoring the HTTP Collector (Pull) endpoint activity.

    • Use the Analytics tab in the targeted Observo AI pipeline to monitor data volume and ensure expected throughput.

    • Check Observo AI logs for errors or warnings related to data ingestion from the HTTP Collector (Pull) source.

Issue
Possible Cause
Resolution

Data not ingested

Incorrect URL or parser configuration

Verify URL and parser settings

Authentication errors

Invalid or expired credentials

Check API key, username, or password validity

Connectivity issues

Firewall or proxy blocking access

Test network connectivity and VPC endpoints

"Inaccessible host"

TLS or DNS issues

Ensure TLS compatibility and check DNS

"Authentication failed"

Misconfigured credentials

Verify auth method and permissions

"Request timeout"

Network latency or low timeout setting

Increase Timeout Secs or check network

Resources

For additional guidance and detailed information, refer to the following resources:

Last updated

Was this helpful?