.. _crawling: The Crawling Pipeline ================================================ Content Chimera can import assets in two ways: crawling and importing (see Data Sources). Whereas most crawlers are optimized for either SEO or generating lists of URLs, Content Chimera's crawler is optimized for gathering content useful for content decisions. Furthermore, it is built for large crawls and to give visibility into the crawl status. .. contents:: On this page :local: :depth: 2 Starting a Crawl ----------------- .. admonition:: Web UI :class: tip Go to the Assets & Metadata page and click on the **Start Crawling** button. In the majority of cases the defaults are sufficient (see Crawl Options below). The web UI runs a focused three-step pipeline: **Crawl → Enhance → Optimize**. This fetches pages, extracts metadata, and prepares data for charting. Additional steps like loading the graph database, semantic database, or running LLM summarization are started separately from the web UI. .. image:: ../images/start-crawling.png :scale: 50 % .. admonition:: MCP :class: tip MCP runs **extended pipelines** that go well beyond the basic crawl. A single command can crawl a site *and then* automatically continue through additional processing: loading the semantic database, summarizing the crawl, loading the graph database, and running graph analysis. Three pipeline variants are available: - **crawl-extended** --- The standard pipeline. Crawls, enhances, optimizes, then loads the semantic and graph databases and summarizes the crawl. - **sprawl** --- Everything in crawl-extended, plus automatically generates a comprehensive Sprawl Report with PDF. - **ai-readiness** --- Crawls, then runs AI readiness metrics and E-E-A-T scoring, and generates an AI Readiness Report with PDF. Example prompts: *"Crawl example.com"* *"Run the sprawl pipeline on this site"* .. raw:: html
Tool reference

Tool: run-long-pipeline

.. note:: You don't necessarily have to know more about crawling than how to click Start Crawling (or ask your AI assistant to start one). The information below is especially useful for larger crawls or if you run into any issues. The Three Pipeline Steps ------------------------- When you start a crawl from the web UI, Content Chimera runs a three-step pipeline. You can monitor progress through three progress circles on the crawl page: .. image:: ../images/crawl-circles.png Each circle represents a step: - **Crawl** — Fetching pages from the site - **Enhance** — Extracting useful metadata from the fetched pages - **Optimize** — Preparing the data for charting and analysis For a very large site, the crawl itself could take hours or days (depending on configuration), with the other steps each taking over an hour. Many of these steps save you from spending time in spreadsheets or other tools. Crawl ^^^^^^ This is the most visible step: Content Chimera follows links across the site to build a list of URLs. Along the way, it stores a cache of each page so that subsequent processing can work from the cached content rather than hitting the client's web server again. .. _enhance: Enhance ^^^^^^^^ The Enhance step pulls out information that is useful for making content decisions — for example, the "folders" in each URL, content type indicators, and other structural metadata. This step is separate from the crawl because it also runs on data imported from other sources (so even if you import URLs from another crawler, you still get folder information and other enhancements). .. _Optimize: Optimize ^^^^^^^^^ The Optimize step prepares the data for fast charting and analysis. Content Chimera stores data across multiple databases to support flexible processing, but to make charting and querying fast, it consolidates everything into a simplified form. This is virtually always the last step of processing. Crawl Statuses --------------- When you first start the crawl, it will go through the following statuses (it may happen so fast that you do not see all the steps): 1. **Unknown** 2. **Queued** 3. **Running** There are a couple other normal statuses that occur after clicking Cancel: 4. **Requested Termination** 5. **Successfully Terminated** If the crawl is running successfully, you will see activity in at least the first progress circle, "Status: Running" displayed on the page, and an animated indicator next to the Cancel button. Watching Crawls Live --------------------- Content Chimera provides rich information while a crawl is in progress: - **Progress circles** — Show how far along each of the three pipeline steps is. - **Sample screenshots** — Taken during the crawl. You can click on these to make annotations even while crawling is underway. - **Crawl chart** showing: - *Redirects* — Shown below the zero line (not considered problems, just tracked). - *Errors* — Pages that could not be fetched. Click the Error Report link for a breakdown. - *Crawled* — URLs already fetched by Content Chimera. - *Found but not yet crawled* — URLs discovered on pages but not yet fetched. - **Live URL list** — A dynamic list of URLs currently being crawled. .. image:: ../images/crawl-watching-animated.gif Encountered Domains -------------------- Once a crawl is complete, a new "Encountered Domains Report" link appears above the crawl progress chart (next to the Error Report link). This report shows all the domains that were encountered during the crawl — across *all* pages. The report shows: - **Root domains** (like "davidhobbsconsulting.com") with the count of encountered links per domain. The most common root domain is shown at the top of the list. - **Subdomains** — Click on a root domain to expand and see all subdomains that were encountered. For each subdomain, an example link is shown (the page containing the link and the URL it was pointing to). .. image:: ../images/encountered-domains.gif