.. _crawling:

The Crawling Pipeline
================================================

Content Chimera can import assets in two ways: crawling and importing (see Data Sources).
Whereas most crawlers are optimized for either SEO or generating lists of URLs, Content
Chimera's crawler is optimized for gathering content useful for content decisions.
Furthermore, it is built for large crawls and to give visibility into the crawl status.

.. contents:: On this page
   :local:
   :depth: 2


Starting a Crawl
-----------------

.. admonition:: Web UI
   :class: tip

   Go to the Assets & Metadata page and click on the **Start Crawling** button. In the
   majority of cases the defaults are sufficient (see Crawl Options below).

   The web UI runs a focused three-step pipeline: **Crawl → Enhance → Optimize**. This
   fetches pages, extracts metadata, and prepares data for charting. Additional steps like
   loading the graph database, semantic database, or running LLM summarization are started
   separately from the web UI.

.. image:: ../images/start-crawling.png
   :scale: 50 %

.. admonition:: MCP
   :class: tip

   MCP runs **extended pipelines** that go well beyond the basic crawl. A single command
   can crawl a site *and then* automatically continue through additional processing:
   loading the semantic database, summarizing the crawl, loading the graph database, and
   running graph analysis.

   Three pipeline variants are available:

   - **crawl-extended** --- The standard pipeline. Crawls, enhances, optimizes, then loads
     the semantic and graph databases and summarizes the crawl.
   - **sprawl** --- Everything in crawl-extended, plus automatically generates a
     comprehensive Sprawl Report with PDF.
   - **ai-readiness** --- Crawls, then runs AI readiness metrics and E-E-A-T scoring, and
     generates an AI Readiness Report with PDF.

   Example prompts:

      *"Crawl example.com"*

      *"Run the sprawl pipeline on this site"*

   .. raw:: html

      <details>
      <summary>Tool reference</summary>
      <p>Tool: <code class="docutils literal notranslate">run-long-pipeline</code></p>
      </details>

.. note:: You don't necessarily have to know more about crawling than how to click
   Start Crawling (or ask your AI assistant to start one). The information below is
   especially useful for larger crawls or if you run into any issues.


The Three Pipeline Steps
-------------------------

When you start a crawl from the web UI, Content Chimera runs a three-step pipeline. You
can monitor progress through three progress circles on the crawl page:

.. image:: ../images/crawl-circles.png

Each circle represents a step:

- **Crawl** — Fetching pages from the site
- **Enhance** — Extracting useful metadata from the fetched pages
- **Optimize** — Preparing the data for charting and analysis

For a very large site, the crawl itself could take hours or days (depending on
configuration), with the other steps each taking over an hour. Many of these steps save
you from spending time in spreadsheets or other tools.

Crawl
^^^^^^

This is the most visible step: Content Chimera follows links across the site to build a
list of URLs. Along the way, it stores a cache of each page so that subsequent processing
can work from the cached content rather than hitting the client's web server again.

.. _enhance:

Enhance
^^^^^^^^

The Enhance step pulls out information that is useful for making content decisions — for
example, the "folders" in each URL, content type indicators, and other structural
metadata. This step is separate from the crawl because it also runs on data imported from
other sources (so even if you import URLs from another crawler, you still get folder
information and other enhancements).

.. _Optimize:

Optimize
^^^^^^^^^

The Optimize step prepares the data for fast charting and analysis. Content Chimera stores
data across multiple databases to support flexible processing, but to make charting and
querying fast, it consolidates everything into a simplified form. This is virtually always
the last step of processing.


Crawl Statuses
---------------

When you first start the crawl, it will go through the following statuses (it may happen
so fast that you do not see all the steps):

1. **Unknown**
2. **Queued**
3. **Running**

There are a couple other normal statuses that occur after clicking Cancel:

4. **Requested Termination**
5. **Successfully Terminated**

If the crawl is running successfully, you will see activity in at least the first
progress circle, "Status: Running" displayed on the page, and an animated indicator next
to the Cancel button.


Watching Crawls Live
---------------------

Content Chimera provides rich information while a crawl is in progress:

- **Progress circles** — Show how far along each of the three pipeline steps is.
- **Sample screenshots** — Taken during the crawl. You can click on these to make
  annotations even while crawling is underway.
- **Crawl chart** showing:
    - *Redirects* — Shown below the zero line (not considered problems, just tracked).
    - *Errors* — Pages that could not be fetched. Click the Error Report link for a
      breakdown.
    - *Crawled* — URLs already fetched by Content Chimera.
    - *Found but not yet crawled* — URLs discovered on pages but not yet fetched.
- **Live URL list** — A dynamic list of URLs currently being crawled.

.. image:: ../images/crawl-watching-animated.gif


Encountered Domains
--------------------

Once a crawl is complete, a new "Encountered Domains Report" link appears above the crawl
progress chart (next to the Error Report link). This report shows all the domains that
were encountered during the crawl — across *all* pages.

The report shows:

- **Root domains** (like "davidhobbsconsulting.com") with the count of encountered links
  per domain. The most common root domain is shown at the top of the list.
- **Subdomains** — Click on a root domain to expand and see all subdomains that were
  encountered. For each subdomain, an example link is shown (the page containing the link
  and the URL it was pointing to).

.. image:: ../images/encountered-domains.gif