Out of the box, SharePoint’s crawler runs at full throughput. It opens as many simultaneous connections to your Web Front End servers as the WFE will accept, downloads every file up to the default size limit, and writes continuously to the crawl database. On a small farm with a modest content corpus, this is invisible. On a large farm with 10 million items, 1GB+ documents, and 2,000 concurrent users, unconfigured crawl behaviour can become a farm-wide performance concern.
Two scripts address this directly: Set-SPSearchCrawlImpact.ps1 sets the rate at which the crawler hits your WFEs and the size threshold above which files are indexed for metadata only. Set-SPSearchCrawlRules.ps1 controls what the crawler visits at all. Together they give you precise control over the crawl pipeline without requiring a maintenance window.
What Crawl Impact Actually Means
A common misunderstanding is that the SharePoint crawler accesses content databases directly. It does not. The crawler requests content from WFEs over HTTP, using the same protocol handlers that end users use. Every crawl request is a real web request to a WFE — it consumes WFE CPU, memory, and IIS worker threads.
On a large farm:
- No impact rules = the crawler opens as many connections as the WFE allows. During a full crawl, WFE CPU can spike to 80–100%, degrading page load time for real users.
- Impact rules = the crawler is told the maximum number of simultaneous requests it can make per host. Setting this to 16 means no more than 16 concurrent HTTP requests to any given WFE at one time.
Crawl impact rules are applied per content source host (the WFE hostname or IP address). You create one rule per WFE in your farm.
Using Set-SPSearchCrawlImpact.ps1
What the Script Configures
The script configures three things:
- Crawler Impact Rules — per-WFE simultaneous request limits.
- MaxDownloadSize — the file size threshold above which full content extraction is skipped. Files above the threshold are still indexed, but only for metadata (title, author, file extension, size).
- Connection and acknowledgement timeouts — how long the crawler waits before abandoning a slow WFE request.
The Large-File Threshold: Why It Matters
Consider a SharePoint document library that contains 1GB video files. Without a download size limit, the crawler downloads each file completely, passes it to Content Processing, which attempts to extract text via an IFilter. For a 1GB video:
- The download alone takes minutes on a loaded network.
- Content Processing holds the slot for the duration.
- Other documents queue behind the video, stalling the entire content processing pipeline.
- The video’s searchable text content is typically zero — video files contain no extractable text.
Setting MaxDownloadSizeMB = 128 means the crawler downloads only the first 128MB of any file. For a 1GB video, only the HTTP headers and file metadata are collected — the crawler immediately moves on.
Note: Files above the MaxDownloadSize threshold are still indexed and appear in search results. Users can still find them by filename, title, author, site, and other managed properties. They are simply not indexed for full-text content.
Running the Script
.\Set-SPSearchCrawlImpact.ps1 ` -Target Team ` -SSAName "Search Service Application" ` -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" ` -MaxDownloadSizeMB 128 ` -ThreadsPerHost 16 ` -ConnectionTimeout 120 ` -DataTimeout 120
| Parameter | Description | Recommended Value |
|---|---|---|
| -WFEList | DNS names or hostnames of WFE servers targeted by the crawler | List all WFEs in the farm |
| -MaxDownloadSizeMB | Maximum file content extraction size in MB | 50–128MB depending on farm size |
| -ThreadsPerHost | Max simultaneous crawler requests per WFE host | 8–24 depending on WFE capacity |
| -ConnectionTimeout | Seconds before the crawler abandons a connection attempt | 120 |
| -DataTimeout | Seconds before the crawler abandons a slow download | 120 |
Recommended Settings by Farm Size
| Farm Size | MaxDownloadSizeMB | ThreadsPerHost |
|---|---|---|
| Small (2 WFEs) | 50 | 8 |
| Medium (4 WFEs) | 50–128 | 12–16 |
| Large (80M+ items, 4+ WFEs) | 128 | 16–24 |
Separate Impact Rules for Collaboration vs. My Sites
If you are running two SSAs (Collaboration and My Sites), run the script separately for each:
# Collaboration SSA — higher thread count for larger content volume.\Set-SPSearchCrawlImpact.ps1 ` -Target Collaboration ` -SSAName "Enterprise Search Service" ` -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" ` -ThreadsPerHost 16# My Sites SSA — lower thread count; My Sites content is lighter.\Set-SPSearchCrawlImpact.ps1 ` -Target "My Sites" ` -SSAName "Personal Search Service" ` -WFEList "<WFE-SERVER-1>", "<WFE-SERVER-2>" ` -ThreadsPerHost 8
Understanding the Crawl Schedule Interaction
Crawl impact rules control concurrency. The crawl schedule controls when crawls run. The two work together, and misconfiguring the schedule negates the benefit of impact rules.
Full Crawl vs. Incremental Crawl
A full crawl visits every item in every content source from scratch, regardless of whether it has changed. It generates the highest crawl load. Full crawls should run during off-hours windows — typically weekends or late-night windows — when WFE capacity is available.
An incremental crawl visits only items that have changed since the last crawl (based on change log entries). It is much lighter and can run during business hours if crawl impact rules are configured correctly.
Recommended Schedule
| Crawl Type | Timing | Impact Rules |
|---|---|---|
| Full crawl | Weekend overnight | All threads available |
| Incremental crawl | Every 15–60 minutes during business hours | 50% of max thread count |
To configure incremental crawl schedule via PowerShell:
$ssa = Get-SPEnterpriseSearchServiceApplication -Identity "Search Service Application"$source = Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa -Identity "Team Sites"# Incremental every 30 minutes, Mon–FriSet-SPEnterpriseSearchCrawlContentSource ` -Identity $source ` -ScheduleType Incremental ` -DailyCrawlSchedule ` -CrawlScheduleStartDateTime "06:00" ` -CrawlScheduleRepeatInterval 30 ` -CrawlScheduleRepeatDuration 720
Using Set-SPSearchCrawlRules.ps1
What Crawl Rules Are
Crawl rules are URL pattern–based rules that tell the crawler how to handle a specific path before it downloads the content. The crawler evaluates each URL it encounters against the list of rules (in priority order) before deciding whether to fetch the content.
Rule types:
| Type | Behaviour |
|---|---|
| Exclude | Skip this URL entirely — do not download, do not index |
| Include | Force-include this URL even if a parent URL is excluded |
| Custom | Apply specific crawl behaviour |
What the Script Configures
Set-SPSearchCrawlRules.ps1 creates two categories of exclusion rules:
1. Redundant view exclusions — SharePoint generates dozens of URL variants for the same list content (sort parameters, filter parameters, view parameters). Without exclusion rules, the crawler indexes the same document multiple times under different URLs, wasting crawl cycles and inflating the index.
Examples of redundant view URLs the script excludes:
*/_layouts/15/viewlsts.aspx**/Forms/AllItems.aspx**?*SortField=**?*FilterField1=**/_vti_bin/*
2. Binary file exclusions — file types that contain no searchable text and should not consume crawl capacity:
*.exe *.vob *.iso *.zip *.bak *.dll *.tmp *.log *.tar *.gz
Running the Script
.\Set-SPSearchCrawlRules.ps1 ` -Target Collaboration ` -SSAName "Search Service Application"# Optional: add custom exclusion paths.\Set-SPSearchCrawlRules.ps1 ` -Target "My Sites" ` -SSAName "Search Service Application" ` -CustomExclusions "*/Archive/*", "*/Temp/*", "*/RecycleBin/*"
The script is idempotent — if a rule already exists, it skips creation with a yellow warning. Safe to re-run.
Crawl Rules vs. Content Sources: Which to Use When
| Approach | When to Use |
|---|---|
| Don’t add to content source | You never want this content crawled, and the start address clearly scopes it |
| Crawl exclusion rule | The crawler discovers the path via links from an already-crawled start address |
| Include override | A parent path is excluded, but a specific subdirectory must be included |
Validating Crawl Performance After Changes
Monitoring the Crawl Log
After making changes to impact rules or crawl rules, start an incremental crawl and review the crawl log:
$ssa = Get-SPEnterpriseSearchServiceApplication -Identity "Search Service Application"$source = Get-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa -Identity "Team Sites"# Start incremental crawl$source.StartIncrementalCrawl()# Monitor after a few minutesGet-SPEnterpriseSearchCrawlContentSource -SearchApplication $ssa | Select Name, CrawlState, SuccessCount, ErrorCount, DeleteCount
Verifying the Large-File Threshold
To confirm the MaxDownloadSize threshold is triggering:
- Open Central Admin → Search Service Application → Crawl → Crawl Log.
- Filter by “Warning” status.
- Look for entries indicating files were skipped due to size limit.
Summary
Crawl impact and crawl rules are the operational controls that separate a well-managed search deployment from one that creates support tickets. Impact rules prevent the crawler from saturating WFEs during business hours. The large-file threshold prevents multi-GB files from stalling the content processing pipeline. Crawl rules prevent redundant view URLs and binary files from consuming crawl capacity that should be spent on actual content.
Both scripts are idempotent and safe to adjust without a maintenance window. After applying changes, an incremental crawl plus a review of the crawl log is sufficient to validate correct behaviour.
👉 SPSE Search Topology Starter Kit (Production-Ready PowerShell + PDF Runbooks)
👉 SPSE Search Config Backup Kit
👉 SPSE Search Deployment Kit for Large Farms
👉 Complete SPSE Search Architecture Pack
Related Posts
- Post #1: Designing the Right SharePoint Search Topology for Production SPSE Farms
- Post #2: Deploying a Custom SharePoint Search Topology with PowerShell (End-to-End)
- Post #3: Scaling SharePoint Search for Large Enterprise Farms: Index Distribution and Crawl Isolation
- Post #5: Federated Search in SPSE: Searching Across Multiple Search Service Applications