Scanning for Large Files in SharePoint Before Migration (and Why It Matters)

Series: SharePoint 2019 to Subscription Edition Migration — Post #3 of 12
Reading time: ~10 minutes

Why Large Files Are a Migration Risk

Migration windows fail for many reasons. One of the quietest is also one of the most preventable: undetected large files buried in content databases.

Here is how it plays out. You schedule a migration wave based on estimated content database sizes. Test restores pass in the lab. On execution day, a restore that should take ninety minutes drags past four hours. The root cause is a content database with several multi-gigabyte files stored inline — files no one inventoried, in a document library no one thought to flag.

Large files create four distinct risks in a migration context:

Content database size inflation. A single 1.8 GB document stored inline in a content database increases every differential backup by the full file size on every backup run. Multiply that across a few dozen large files and your database sizes stop matching your migration estimates.
Slow or failed restore operations. Backup and restore times scale non-linearly when large inline files are present. Lab restores pass because they use clean, recently-provisioned content databases. Production restores fail because databases grow quietly over years — and no one ran an inventory before the window opened.
Broken Remote BLOB Storage configurations. If files have been partially externalised via Remote BLOB Storage (RBS) on the source farm, migrating without a clear picture of which files are affected — and which are still inline — creates data loss risk. RBS configurations do not migrate themselves.
Mis-sequenced migration waves. Libraries with high large-file concentration need to be isolated or addressed before you finalise wave planning. Discovering them during execution is the wrong time.

This is a risk you can quantify before day one. You just need the right inventory.

What Counts as “Large” in SharePoint Context

The answer depends on what you are measuring against.

SharePoint’s configurable maximum upload size (set per web application under General Settings → Maximum Upload Size) has a hard ceiling of 2 GB for SP2019. That boundary matters for content stored inline in content databases — a file approaching that limit is not just large, it is structurally significant in how the database handles it.

In practice, “large” has tiers:

50 MB is where migration tooling starts behaving differently. Many tools throttle, retry, or skip files above this threshold without warning.
250 MB is where backup and restore timelines become materially affected. A handful of files above this size can shift your restore window by an order of magnitude.
1 GB is where explicit wave isolation or pre-migration offload becomes the right conversation, not an optional one.
2 GB+ is a blocker category. These files require deliberate planning before a single content database is backed up for migration.

Context matters too. A 200 MB file in a 500 GB content database is noise. The same file in a 2 GB database is 10% of your total content — and a structural decision point.

The right approach is not a single threshold. It is size bucketing: classifying every large file by tier so you can see the full risk distribution across the farm — not just whether large files exist, but where they concentrate and how severe each concentration is.

How CAML-Based Scanning Identifies Large Files at Scale

CAML-Based Scanning

The Large File Scanner uses CAML queries to identify oversized files at the server level. This is the correct approach for scale.

The alternative — enumerating every item in every document library via PowerShell loops — works on small farms. On large ones, it times out, consumes excessive memory, and produces incomplete results because it cannot filter at the source. CAML queries push the filtering to the SharePoint server: only files above the size threshold come back, regardless of library size.

Here is a simplified version of the core query structure the scanner uses:

			
<Query>
  <Where>
    <Geq>
      <FieldRef Name="File_x0020_Size"/>
      <Value Type="Integer">104857600</Value>
    </Geq>
  </Where>
  <OrderBy>
    <FieldRef Name="File_x0020_Size" Ascending="FALSE"/>
  </OrderBy>
</Query>

		

104857600 bytes is 100 MB. In the production script, this threshold is a parameter — you set it at runtime. The File_x0020_Size field is the internal SharePoint column that stores the file size in bytes; it is available on all document libraries without any custom configuration.

Results come back ordered largest-first. Farm-wide enumeration is handled by Get-SiteCollections; the actual query runs per site collection via Get-LargeFilesFromSite.

Size Bucketing

Finding files above a threshold is step one. Step two is understanding the shape of the problem — how many files are in each risk tier, not just a raw list sorted by bytes.

The scanner classifies every file it returns into a named size bucket:

			
function Get-SizeBucket {
    param([long]$FileSizeBytes, [long[]]$Thresholds)
    switch ($FileSizeBytes) {
        { $_ -ge $Thresholds[3] } { return ">= 1GB" }
        { $_ -ge $Thresholds[2] } { return "500MB–1GB" }
        { $_ -ge $Thresholds[1] } { return "200MB–500MB" }
        default                   { return "100MB–200MB" }
    }
}

		

Thresholds are passed in as a parameter array, so you can adjust the bucket boundaries to match your farm’s reality or your organisation’s migration standards.

Buckets matter for a specific reason: a document library with 200 files in the >= 1GB bucket demands a different migration response — wave isolation, content owner coordination, or cleanup — compared to a library with one file in the same bucket.

The Three Output Files

The scanner produces three structured CSV files per run:

Per-file detail CSV — one row per oversized file, including file name, full URL, document library title, site collection URL, file size in bytes, and size bucket.
Library summary CSV — one row per document library, totalling files scanned, total large-file size in GB, and the largest single file found.
Site summary CSV — one row per site collection, totalling libraries scanned, files scanned, and total large-file storage in GB.

The script also produces a run log via Write-InvLog for every execution — useful when you need to hand off findings to a migration architect or include them in a risk register.

How to Turn the Scanner Output Into Migration Decisions

This is the section most “how to find large files” posts skip. The data is only useful if you know what question each file answers.

Output File	Primary Question It Answers	Key Signal to Look For
Per-file detail CSV	Which specific files need manual action before migration?	A single content owner or department generating most of the largest files — that is a coordination target
Library summary CSV	Which document libraries are migration risks as units?	Low file count with disproportionately high total size — a few enormous files is harder to manage than many medium ones
Site summary CSV	Which site collections drive the farm’s large-file risk profile?	If the top five sites by large-file total account for more than 50% of the farm’s large-file storage, those five sites define your entire problem

Per-File Detail CSV

Use the per-file detail output to identify files that need manual intervention before migration day. Filter by size bucket, starting with the highest tier. The URL column in the output allows direct navigation to the file in the SharePoint UI — useful when you need content owners to take action on their own files rather than having the migration team handle it.

The per-file CSV is also your cleanup filter input. Files in the 100–200 MB range that are several years old and have low view counts are candidates for archival before migration. That analysis lives in this file.

Library Summary CSV

The library summary is where you make wave-sequencing decisions at the library level. A document library with aggregate large-file storage above 10 GB is a candidate for its own migration wave — or at minimum, explicit planning for the wave it belongs to.

If Remote BLOB Storage is in use on the source farm, cross-reference the library summary against your RBS configuration. Libraries using RBS need a different migration path than libraries with inline content, and the library summary will surface which libraries are the highest-risk for that distinction.

Site Summary CSV

The site summary is your wave-planning input. Sites with high large-file concentration should be scheduled in smaller, isolated waves, or migrated early to give the team runway to address surprises without holding up other content.

The site summary also maps directly to content database sizing decisions. If the summary shows 80 GB of large files concentrated in three site collections, you have concrete data for database split or resize planning before migration begins — not an estimate, an inventory.

If the top five sites by large-file storage account for more than half of the farm’s total large-file volume, those five sites own the risk profile. Triage them discretely.

Migration Decisions Based on Findings

The scanner gives you data. What you do with it depends on your farm. Here are four decision paths the findings typically lead to:

1. Wave planning adjustment

If a site collection has more than 20 GB of large-file content, treat it as a candidate for its own migration wave. Do not stack it with other high-volume sites. The migration window math changes materially when large inline files are involved — plan for it explicitly rather than absorbing it as schedule risk.

2. Remote BLOB Storage evaluation

If a significant portion of your total farm storage — rough threshold: above 30% — is in the 500 MB or larger buckets, evaluate whether configuring RBS on the destination farm is worth the setup investment. Without RBS, these files bloat content databases on the destination permanently, and you carry that overhead forward. The site summary CSV gives you the aggregate data to run this calculation.

3. Pre-migration cleanup

The per-file detail CSV contains everything you need to identify cleanup candidates: file size, URL, library, and site. Files in the lower buckets that are old and rarely accessed are typically the easiest wins — they reduce content database sizes before the first backup without requiring content owner coordination. Frame cleanup as an option with clear benefit, not a migration prerequisite.

4. No action required — but document it

If the scanner returns a small number of large files in isolated locations, your migration plan may not need to change. That is a valid outcome. Document it anyway. Undocumented known risks become undocumented surprises if circumstances change between planning and execution.

A risk register entry that says “farm contains 12 files over 500 MB, all located in the Marketing Archive library on site collection /sites/marketing, included in Wave 3” is more useful than silence — even if no one ever acts on it.

Get the Large File Scanner Script

Stop guessing your storage profile. Know your farm before migration day.

The Get-SPLargeFileInventory.ps1 script is production-ready and fully parameterized. The full script includes everything shown here, plus:

Farm-wide and per-site execution modes
Configurable size thresholds at runtime (no hardcoding)
All three CSV outputs, structured for Excel pivot analysis
Run logging for audit trails and handoffs
Error handling and retry logic for large farms

👉 Interested in the script

Also available as part of the Complete SharePoint Migration Toolkit — contact the same address for details.

Next in the Series

Once you have your large-file inventory, the next pre-migration risk area is workflows — specifically, identifying which ones are still running, which are tied to deprecated platforms, and which will break silently after cut-over.

Post #4 covers the workflow audit in detail: Auditing Workflows Before Migration — What Still Runs?

The large-file findings from this post also feed directly into the SQL log shipping strategy covered in Post #6. Content database sizes determine log shipping volume and restore time estimates — the site summary CSV is the input that makes that planning concrete rather than speculative.

Large-file findings also shape cutover timing — particularly the decision to schedule high-risk libraries as early waves rather than bundling them with general content. Post #8 covers the full cutover playbook.

← Post #2: Building a Complete Farm Inventory Before You Migrate
→ Post #4: Auditing Workflows Before Migration — What Still Runs?

Knowledge Share

Sharing is Caring

Scanning for Large Files in SharePoint Before Migration (and Why It Matters)

Why Large Files Are a Migration Risk

What Counts as “Large” in SharePoint Context