Custom URL and sitemap scanning

  • Updated

Overview

Osano provides multiple scanning features to help customers identify cookies, scripts, and iframes used across their web applications. In addition to continuous scanning via Osano.js, customers can leverage:

  1. URL Scanning – Periodically scans individual customer-defined URLs.
  2. Sitemap Scanning – Automates URL discovery by scanning a provided sitemap XML file.

Both scanning features provide valuable insights into the trackers in use across web properties. Below, we outline how each works and key considerations for implementation.


URL Scanning

The URL Scanning feature runs a monthly scan (subject to change in future updates) on a customer-entered URL. This scan identifies cookies, scripts, and iframes present on the page at that URL.

  • This is an optional feature that provides a broader view of tracking technologies in use.
  • The scan operates using a headless browser (Puppeteer) within a containerized (Docker) environment.
  • Unlike the Osano.js script, which runs in an end-user’s browser, this scan can detect all cookies, including "server" or "HTTP-only" cookies, which are normally inaccessible to JavaScript.
  • Any unmanaged cookies, scripts, or iframes discovered are reported similarly to findings from the Osano.js scanner.

How the URL Scan Works

  1. Osano launches a headless browser to visit the specified URL.
  2. The browser scans the page for:
    • Cookies (including HTTP-only cookies)
    • Scripts
    • Iframes
  3. The scan results are compiled and presented in Osano’s dashboard.

 

The diagram below describes, at a high level, the URL scanning process.URL Scan Process


Sitemap Scanning

For customers managing multiple web applications, Osano simplifies scanning by allowing the use of a sitemap XML file. Instead of entering individual URLs, customers can provide a sitemap that lists all relevant web pages.

  • The system automatically updates the scan roster based on URLs found in the sitemap.
  • Scans run monthly (subject to change in future updates).
  • Limitations:
    • Customers on the Premier package are limited to 500 scanned pages (URLs) per sitemap. Any URLs exceeding this limit will not be scanned.
    • Nested sitemaps are not supported.

What is a Nested Sitemap?

A nested sitemap is a sitemap that contains references to additional sitemap files rather than listing URLs directly. These are often used by large websites to organize URLs into multiple smaller sitemaps.

Example:

  • sitemap.xml → References sitemap1.xml, sitemap2.xml, etc.
  • sitemap1.xml → Contains a list of URLs to scan.
  • Osano does not process these nested references—only the URLs listed in the primary sitemap file can be scanned.

How the Sitemap Scan Works

  1. Provide the URL of a sitemap XML file (e.g., https://example.com/sitemap.xml).
  2. Osano retrieves and processes the sitemap, extracting all URLs listed.
  3. The system updates the scan roster to match the sitemap contents.
  4. The next scan will process only the URLs in the primary sitemap up to the 500-page limit.

 

Summary of Key Differences

Feature URL Scanning Sitemap Scanning
Purpose Scans a single user-defined URL Scans multiple URLs from a provided sitemap
Scan Frequency Monthly (subject to change) Monthly (subject to change)
Detection Cookies, scripts, iframes Cookies, scripts, iframes (per sitemap URLs)
Limitations N/A 500-page limit (Premier package), no nested sitemap support
Use Case Checking trackers on a specific page Automating tracking audits across multiple web pages