Controlling SharePoint Crawl Performance: Impact Rules and Crawl Rules

Out of the box, SharePoint’s crawler runs at full throughput. It opens as many simultaneous connections to your Web Front End servers as the WFE will accept, downloads every file up to the default size limit, and writes continuously to the crawl database. On a small farm with a modest content corpus, this is invisible. On a large farm with 10 million items, 1GB+ documents, and 2,000 concurrent users, unconfigured crawl behaviour can become a farm-wide performance concern.

Two scripts address this directly: Set-SPSearchCrawlImpact.ps1 sets the rate at which the crawler hits your WFEs and the size threshold above which files are indexed for metadata only. Set-SPSearchCrawlRules.ps1 controls what the crawler visits at all. Together they give you precise control over the crawl pipeline without requiring a maintenance window.

What Crawl Impact Actually Means

A common misunderstanding is that the SharePoint crawler accesses content databases directly. It does not. The crawler requests content from WFEs over HTTP, using the same protocol handlers that end users use. Every crawl request is a real web request to a WFE — it consumes WFE CPU, memory, and IIS worker threads.

On a large farm:

No impact rules = the crawler opens as many connections as the WFE allows. During a full crawl, WFE CPU can spike to 80–100%, degrading page load time for real users.
Impact rules = the crawler is told the maximum number of simultaneous requests it can make per host. Setting this to 16 means no more than 16 concurrent HTTP requests to any given WFE at one time.

Crawl impact rules are applied per content source host (the WFE hostname or IP address). You create one rule per WFE in your farm.

Using Set-SPSearchCrawlImpact.ps1

What the Script Configures

The script configures three things:

Crawler Impact Rules — per-WFE simultaneous request limits.
MaxDownloadSize — the file size threshold above which full content extraction is skipped. Files above the threshold are still indexed, but only for metadata (title, author, file extension, size).
Connection and acknowledgement timeouts — how long the crawler waits before abandoning a slow WFE request.

The Large-File Threshold: Why It Matters

Consider a SharePoint document library that contains 1GB video files. Without a download size limit, the crawler downloads each file completely, passes it to Content Processing, which attempts to extract text via an IFilter. For a 1GB video:

The download alone takes minutes on a loaded network.
Content Processing holds the slot for the duration.
Other documents queue behind the video, stalling the entire content processing pipeline.
The video’s searchable text content is typically zero — video files contain no extractable text.

Setting MaxDownloadSizeMB = 128 means the crawler downloads only the first 128MB of any file. For a 1GB video, only the HTTP headers and file metadata are collected — the crawler immediately moves on.

Note: Files above the MaxDownloadSize threshold are still indexed and appear in search results. Users can still find them by filename, title, author, site, and other managed properties. They are simply not indexed for full-text content.

Running the Script

			
.\Set-SPSearchCrawlImpact.ps1 `
    -Target Team `
    -SSAName "Search Service Application" `
    -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" `
    -MaxDownloadSizeMB 128 `
    -ThreadsPerHost 16 `
    -ConnectionTimeout 120 `
    -DataTimeout 120

		

Parameter	Description	Recommended Value
-WFEList	DNS names or hostnames of WFE servers targeted by the crawler	List all WFEs in the farm
-MaxDownloadSizeMB	Maximum file content extraction size in MB	50–128MB depending on farm size
-ThreadsPerHost	Max simultaneous crawler requests per WFE host	8–24 depending on WFE capacity
-ConnectionTimeout	Seconds before the crawler abandons a connection attempt	120
-DataTimeout	Seconds before the crawler abandons a slow download	120

Recommended Settings by Farm Size

Farm Size	MaxDownloadSizeMB	ThreadsPerHost
Small (2 WFEs)	50	8
Medium (4 WFEs)	50–128	12–16
Large (80M+ items, 4+ WFEs)	128	16–24

Separate Impact Rules for Collaboration vs. My Sites

If you are running two SSAs (Collaboration and My Sites), run the script separately for each:

			
# Collaboration SSA — higher thread count for larger content volume
.\Set-SPSearchCrawlImpact.ps1 `
    -Target Collaboration `
    -SSAName "Enterprise Search Service" `
    -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" `
    -ThreadsPerHost 16
# My Sites SSA — lower thread count; My Sites content is lighter
.\Set-SPSearchCrawlImpact.ps1 `
    -Target "My Sites" `
    -SSAName "Personal Search Service" `
    -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" `
    -ThreadsPerHost 8

		

Understanding the Crawl Schedule Interaction

Crawl impact rules control concurrency. The crawl schedule controls when crawls run. The two work together, and misconfiguring the schedule negates the benefit of impact rules.

Full Crawl vs. Incremental Crawl

A full crawl visits every item in every content source from scratch, regardless of whether it has changed. It generates the highest crawl load. Full crawls should run during off-hours windows — typically weekends or late-night windows — when WFE capacity is available.

An incremental crawl visits only items that have changed since the last crawl (based on change log entries). It is much lighter and can run during business hours if crawl impact rules are configured correctly.

Recommended Schedule

Crawl Type	Timing	Impact Rules
Full crawl	Weekend overnight	All threads available
Incremental crawl	Every 15–60 minutes during business hours	50% of max thread count

To configure incremental crawl schedule via PowerShell:

			
$ssa = Get-SPEnterpriseSearchServiceApplication -Identity "Search Service Application"
$source = Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa -Identity "Team Sites"
# Incremental every 30 minutes, Mon–Fri
Set-SPEnterpriseSearchCrawlContentSource `
    -Identity $source `
    -ScheduleType Incremental `
    -DailyCrawlSchedule `
    -CrawlScheduleStartDateTime "06:00" `
    -CrawlScheduleRepeatInterval 30 `
    -CrawlScheduleRepeatDuration 720

		

Using Set-SPSearchCrawlRules.ps1

What Crawl Rules Are

Crawl rules are URL pattern–based rules that tell the crawler how to handle a specific path before it downloads the content. The crawler evaluates each URL it encounters against the list of rules (in priority order) before deciding whether to fetch the content.

Rule types:

Type	Behaviour
Exclude	Skip this URL entirely — do not download, do not index
Include	Force-include this URL even if a parent URL is excluded
Custom	Apply specific crawl behaviour

What the Script Configures

Set-SPSearchCrawlRules.ps1 creates two categories of exclusion rules:

1. Redundant view exclusions — SharePoint generates dozens of URL variants for the same list content (sort parameters, filter parameters, view parameters). Without exclusion rules, the crawler indexes the same document multiple times under different URLs, wasting crawl cycles and inflating the index.

Examples of redundant view URLs the script excludes:

			
*/_layouts/15/viewlsts.aspx*
*/Forms/AllItems.aspx*
*?*SortField=*
*?*FilterField1=*
*/_vti_bin/*

		

2. Binary file exclusions — file types that contain no searchable text and should not consume crawl capacity:

*.exe  *.vob  *.iso  *.zip  *.bak  *.dll  *.tmp  *.log  *.tar  *.gz

Running the Script

			
.\Set-SPSearchCrawlRules.ps1 `
    -Target Collaboration `
    -SSAName "Search Service Application"
# Optional: add custom exclusion paths
.\Set-SPSearchCrawlRules.ps1 `
    -Target "My Sites" `
    -SSAName "Search Service Application" `
    -CustomExclusions "*/Archive/*", "*/Temp/*", "*/RecycleBin/*"

		

The script is idempotent — if a rule already exists, it skips creation with a yellow warning. Safe to re-run.

Crawl Rules vs. Content Sources: Which to Use When

Approach	When to Use
Don’t add to content source	You never want this content crawled, and the start address clearly scopes it
Crawl exclusion rule	The crawler discovers the path via links from an already-crawled start address
Include override	A parent path is excluded, but a specific subdirectory must be included

Validating Crawl Performance After Changes

Monitoring the Crawl Log

After making changes to impact rules or crawl rules, start an incremental crawl and review the crawl log:

			
$ssa = Get-SPEnterpriseSearchServiceApplication -Identity "Search Service Application"
$source = Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa -Identity "Team Sites"
# Start incremental crawl
$source.StartIncrementalCrawl()
# Monitor after a few minutes
Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa |
    Select Name, CrawlState, SuccessCount, ErrorCount, DeleteCount

		

Verifying the Large-File Threshold

To confirm the MaxDownloadSize threshold is triggering:

Open Central Admin → Search Service Application → Crawl → Crawl Log.
Filter by “Warning” status.
Look for entries indicating files were skipped due to size limit.

Summary

Crawl impact and crawl rules are the operational controls that separate a well-managed search deployment from one that creates support tickets. Impact rules prevent the crawler from saturating WFEs during business hours. The large-file threshold prevents multi-GB files from stalling the content processing pipeline. Crawl rules prevent redundant view URLs and binary files from consuming crawl capacity that should be spent on actual content.

Both scripts are idempotent and safe to adjust without a maintenance window. After applying changes, an incremental crawl plus a review of the crawl log is sufficient to validate correct behaviour.

👉 SPSE Search Topology Starter Kit (Production-Ready PowerShell + PDF Runbooks)

👉 SPSE Search Config Backup Kit

👉 SPSE Search Deployment Kit for Large Farms

👉 SPSE Crawl Optimisation Kit

👉 Complete SPSE Search Architecture Pack

Post #1: Designing the Right SharePoint Search Topology for Production SPSE Farms
Post #2: Deploying a Custom SharePoint Search Topology with PowerShell (End-to-End)
Post #3: Scaling SharePoint Search for Large Enterprise Farms: Index Distribution and Crawl Isolation
Post #5: Federated Search in SPSE: Searching Across Multiple Search Service Applications

Knowledge Share

Sharing is Caring

Controlling SharePoint Crawl Performance: Impact Rules and Crawl Rules

What Crawl Impact Actually Means

Using Set-SPSearchCrawlImpact.ps1

What the Script Configures

The Large-File Threshold: Why It Matters

Running the Script

Recommended Settings by Farm Size

Separate Impact Rules for Collaboration vs. My Sites

Understanding the Crawl Schedule Interaction

Full Crawl vs. Incremental Crawl

Recommended Schedule

Using Set-SPSearchCrawlRules.ps1

What Crawl Rules Are

What the Script Configures

Running the Script

Crawl Rules vs. Content Sources: Which to Use When

Validating Crawl Performance After Changes

Monitoring the Crawl Log

Verifying the Large-File Threshold

Summary

Related Posts

Like this:

Related

Leave a ReplyCancel reply

What Crawl Impact Actually Means

Using Set-SPSearchCrawlImpact.ps1

What the Script Configures

The Large-File Threshold: Why It Matters

Running the Script

Recommended Settings by Farm Size

Separate Impact Rules for Collaboration vs. My Sites

Understanding the Crawl Schedule Interaction

Full Crawl vs. Incremental Crawl

Recommended Schedule

Using Set-SPSearchCrawlRules.ps1

What Crawl Rules Are

What the Script Configures

Running the Script

Crawl Rules vs. Content Sources: Which to Use When

Validating Crawl Performance After Changes

Monitoring the Crawl Log

Verifying the Large-File Threshold

Summary

Related Posts

Share the knowledge

Like this:

Related

Leave a ReplyCancel reply

Discover more from Knowledge Share