What the scanner does
The DQM scanner is a configurable, general-purpose web crawler designed to fetch and analyze as much content as possible from a target website. It operates by:
- Starting from one or more initial URLs
- Parsing each fetched page for navigable links (<a href="...">)
- Recursively visiting each unique URL within the defined domain scope
- Capturing content once per unique URL—meaning dynamic updates on the same URL are not rescanned
Limitations:
The scanner has has limited interactive capabilities at the start of the crawl to bypass simple access barriers such as:
- Login forms
- Cookie consent banner
- Age verification gates
These interactions occur only during the initial phase to enable crawl access—not during the full site scan.
Prerequisites
To enable a successful and complete scan, the following conditions must be in place:
Website accessibility
- The scanner must be able to access the website without being blocked by:
- CAPTCHA challenges
- Anti-bot protections (e.g. WAFs with "shields-up" mode)
- Any human-verification or interactive barriers
- The scanner:
- Identifies itself as MagusBot 1.0
- Operates from a known range of IP addresses. See Do I need to whitelist DQM IP addresses? article for more information.
Authentication constraints
The scanner can input credentials into login frames, but it cannot complete login flows that require:
- Multi-Factor Authentication (MFA)
- Email-based login confirmations (e.g. "Click the link we sent you")
If your login flow includes these, your team must work with us to configure alternative access.
Site structure consistency
- The site should support anchor-based navigation (<a href="...">) for link discovery
-
Avoid relying on:
- <button> elements for navigation
- JavaScript-only routing (common in SPAs)
- Any changes to login or navigation processes can disrupt scanning and must be communicated in advance
How the scanner works
Crawl initiation
We support the use of a sitemap to initiate the crawl. Customers can provide a list of URLs up front that the crawler scans, then loads each page and scans for HTML anchor tags (<a href="...">).
Link discovery
- Each discovered link is evaluated to determine:
- If it has already been visited
- If it is within the defined domain/scope
- Valid links are added to the crawl queue
Content capture
- The scanner fetches and processes one piece of content per unique URL
- It does not handle dynamic content changes on the same URL
Initial phase interactions
- At the beginning of the crawl, the scanner can:
- Input text into login fields
- Click through cookie banners
- Respond to age gates
- These are intended to unlock protected content, not to interact with the site during deep scanning
Limitations
-
The crawler does not:
- Activate user UI actions with unknown outcomes
- Navigate through JavaScript-only paths
- Handle multi-step flows requiring checkboxes or manual confirmation
Example: It cannot proceed through checkout flows requiring acceptance of terms via a checkbox
Best practices
To maximize crawl efficiency and data completeness:
- Allow/approve traffic from our IP range and avoid blocking MagusBot 1.0
- Disable or relax bot detection measures (e.g. CAPTCHA, rate-limiting, behavioral detection). This allows the scanner to bypass bot protections approving our IP range or user agent.
- Use anchor tags (<a href="...">) for internal navigation wherever possible
- Provide a sitemap if internal links are not easily discoverable
- Avoid dynamic, non-anchor-based navigational methods (e.g., JavaScript buttons)
For any questions related to the DQM scanner, please contact Crownpeak Support.
Comments
0 comments
Article is closed for comments.