docs: add StringExtensions refactor design document

Comprehensive design for splitting the 1,600-line StringExtensions.cs
into 5 focused partial class files with performance optimizations.

Key decisions:
- Partial class approach to prevent breaking changes
- 5 files: Culture, Manipulation, Encoding, Parsing, Sanitization
- Performance fixes for regex caching and char operations
- 4-phase implementation with individual plans and summaries

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-12-07 20:43:57 +00:00
parent 475010148b
commit f384d9b332

View File

@@ -0,0 +1,342 @@
# StringExtensions Refactor Design
**Date**: 2025-12-07
**Branch**: `refactor/StringExtensions`
**Status**: Approved
---
## 1. Overview
### Problem Statement
The `StringExtensions.cs` file in `Umbraco.Core` is a 1,600-line "utility dumping ground" with:
- Many unrelated string operations mixed together
- Performance issues in several methods
- Difficult to navigate and maintain
### Goals
1. **Consolidate & Organize**: Split into logical partial class files
2. **Reduce Complexity**: Each file focuses on one category of operations
3. **Fix Performance**: Optimize methods with known inefficiencies
4. **Maintain Compatibility**: Zero breaking changes
### Scope
**In Scope:**
- `src/Umbraco.Core/Extensions/StringExtensions.cs` (1,600 lines) - split and optimize
**Out of Scope (no changes):**
- `src/Umbraco.Web.Common/Extensions/StringExtensions.cs` - internal, project-specific
- `src/Umbraco.Cms.Persistence.EFCore/StringExtensions.cs` - internal, project-specific
- Test project StringExtensions files - separate concern
---
## 2. File Structure
Using **partial class** approach to prevent breaking changes:
```
src/Umbraco.Core/Extensions/
├── StringExtensions.cs → DELETE
├── StringExtensions.Culture.cs → NEW
├── StringExtensions.Manipulation.cs → NEW
├── StringExtensions.Encoding.cs → NEW
├── StringExtensions.Parsing.cs → NEW
└── StringExtensions.Sanitization.cs → NEW
```
Each file follows this pattern:
```csharp
// Copyright (c) Umbraco.
// See LICENSE for more details.
namespace Umbraco.Extensions;
public static partial class StringExtensions
{
// Methods for this category
}
```
---
## 3. Method Assignments
### StringExtensions.Culture.cs (~80 lines)
**Methods:**
- `InvariantEquals`
- `InvariantStartsWith`
- `InvariantEndsWith`
- `InvariantContains` (both overloads)
- `InvariantIndexOf`
- `InvariantLastIndexOf`
- `InvariantFormat`
- `ToInvariantString` (int and long overloads)
- `EnsureCultureCode`
**Static Fields:** None
---
### StringExtensions.Manipulation.cs (~300 lines)
**Methods:**
- `Trim`, `TrimStart`, `TrimEnd` (string overloads)
- `EnsureStartsWith`, `EnsureEndsWith` (all overloads)
- `ToFirstUpper`, `ToFirstLower` (all overloads including culture/invariant)
- `ReplaceMany`, `ReplaceFirst`, `Replace` (StringComparison overload)
- `ReplaceNonAlphanumericChars` (both overloads)
- `ExceptChars`
- `Truncate`, `StripWhitespace`, `StripNewLines`, `ToSingleLine`
- `MakePluralName`, `IsVowel`
- `IsLowerCase`, `IsUpperCase`
- All `IShortStringHelper` wrappers:
- `ToSafeAlias` (3 overloads)
- `ToUrlSegment` (2 overloads)
- `ToCleanString` (4 overloads)
- `SplitPascalCasing`
- `ToSafeFileName` (2 overloads)
- `SpaceCamelCasing` (internal)
**Static Fields:**
- `Whitespace` (Lazy<Regex>) - shared, used by `StripWhitespace`
---
### StringExtensions.Encoding.cs (~200 lines)
**Methods:**
- `GenerateHash` (both overloads)
- `ToSHA1`
- `ToUrlBase64`, `FromUrlBase64`
- `UrlTokenEncode` (note: extends `byte[]`, not `string`)
- `UrlTokenDecode`
- `ConvertToHex`, `DecodeFromHex`
- `EncodeAsGuid`, `ToGuid`
- `CreateGuidFromHash` (internal)
- `SwapByteOrder` (internal)
- `ToCSharpString`
- `EncodeJsString`
- Private `GenerateHash(string, string?)` helper
**Static Fields:**
- `ToCSharpHexDigitLower`
- `ToCSharpEscapeChars`
- `UrlNamespace`
**Static Constructor:** Yes (initializes `ToCSharpEscapeChars`)
---
### StringExtensions.Parsing.cs (~150 lines)
**Methods:**
- `IsNullOrWhiteSpace`, `IfNullOrWhiteSpace`, `OrIfNullOrWhiteSpace`, `NullOrWhiteSpaceAsNull`
- `DetectIsJson`, `DetectIsEmptyJson`
- `ParseInto` (both overloads)
- `EnumTryParse`, `EnumParse`
- `ToDelimitedList`
- `EscapedSplit`
- `ContainsAny`, `CsvContains`, `CountOccurrences`
- `GetIdsFromPathReversed`
**Static Fields:**
- `JsonEmpties`
- `DefaultEscapedStringEscapeChar` (const)
---
### StringExtensions.Sanitization.cs (~150 lines)
**Methods:**
- `CleanForXss`, `StripHtml`, `ToValidXmlString`
- `EscapeRegexSpecialCharacters`
- `StripFileExtension`, `GetFileExtension`
- `NormaliseDirectoryPath`, `IsFullPath`
- `AppendQueryStringToUrl`
- `ToFriendlyName`
- `IsEmail`
- `GenerateStreamFromString` (internal)
**Static Fields:**
- `CleanForXssChars`
- `InvalidXmlChars` (Lazy<Regex>)
---
## 4. Performance Fixes
Applied during Phase 3:
### 4.1 Regex Caching
| Method | Current Issue | Fix |
|--------|---------------|-----|
| `StripWhitespace` | `Regex.Replace(txt, @"\s", ...)` each call | Create dedicated cached `Lazy<Regex>` |
| `GetFileExtension` | `new Regex(pattern)` each call | Create cached `Lazy<Regex>` |
| `StripHtml` | `Regex.Replace(..., Compiled)` each call | Create cached `Lazy<Regex>` |
### 4.2 Char Case Checks
```csharp
// Current (allocates string)
public static bool IsLowerCase(this char ch) =>
ch.ToString(CultureInfo.InvariantCulture) ==
ch.ToString(CultureInfo.InvariantCulture).ToLowerInvariant();
// Fixed (no allocation)
public static bool IsLowerCase(this char ch) => char.IsLower(ch);
public static bool IsUpperCase(this char ch) => char.IsUpper(ch);
```
### 4.3 ReplaceNonAlphanumericChars(string)
Only the `string` overload needs fixing (the `char` overload is already optimized):
```csharp
// Current (LINQ + string allocations in loop)
foreach (var c in mName.ToCharArray().Where(c => !char.IsLetterOrDigit(c)))
{
mName = mName.Replace(c.ToString(...), replacement);
}
// Fixed (single pass with StringBuilder for multi-char replacement)
```
---
## 5. Testing Strategy
### Baseline (Phase 1)
- Run existing tests, record pass/fail state
- Run existing benchmarks, record baseline numbers
- Add missing unit tests for methods being optimized
### Verification (Phase 4)
- Run all unit tests - must match Phase 1 results
- Run benchmarks - compare against baseline
- Expected improvements:
- Regex methods: significant (no repeated compilation)
- Char case checks: ~10-100x faster (no allocation)
- ReplaceNonAlphanumericChars: moderate
### New Test File
`tests/Umbraco.Tests.UnitTests/Umbraco.Core/Extensions/StringExtensionsRefactorTests.cs`
### Benchmark Additions
Add to `StringExtensionsBenchmarks.cs`:
- `StripWhitespace_Benchmark`
- `GetFileExtension_Benchmark`
- `StripHtml_Benchmark`
- `IsLowerCase_Benchmark`
- `ReplaceNonAlphanumericChars_Benchmark`
---
## 6. Implementation Phases
### Workflow for Each Phase
1. **Create Implementation Plan** - Detailed step-by-step plan
2. **User Review** - User approves the plan
3. **Save Plan** - Write to `docs/plans/` folder
4. **Execute** - Use subagent-driven development to implement
5. **Completion Summary** - Copy plan, update with results, save to `docs/plans/`
---
### Phase 1: Baseline Testing
**Objective:** Establish baseline test results and performance metrics before any changes.
**Deliverables:**
- Run existing StringExtensions unit tests
- Run existing benchmarks and record results
- Create new unit tests for methods being optimized
- Create new benchmarks for methods being optimized
- Document baseline metrics
**Plan Document:** `docs/plans/phase-1-baseline-testing-plan.md`
**Summary Document:** `docs/plans/phase-1-baseline-testing-summary.md`
---
### Phase 2: File Split
**Objective:** Split `StringExtensions.cs` into 5 partial class files with no functional changes.
**Deliverables:**
- Create 5 new partial class files
- Delete original `StringExtensions.cs`
- Verify all tests still pass (no behavioral changes)
- Single atomic commit
**Plan Document:** `docs/plans/phase-2-file-split-plan.md`
**Summary Document:** `docs/plans/phase-2-file-split-summary.md`
---
### Phase 3: Performance Fixes
**Objective:** Apply performance optimizations to identified methods.
**Deliverables:**
- Cache regex patterns for `StripWhitespace`, `GetFileExtension`, `StripHtml`
- Optimize `IsLowerCase`, `IsUpperCase`
- Optimize `ReplaceNonAlphanumericChars(string)`
- Verify all tests still pass
**Plan Document:** `docs/plans/phase-3-performance-fixes-plan.md`
**Summary Document:** `docs/plans/phase-3-performance-fixes-summary.md`
---
### Phase 4: Verification
**Objective:** Confirm refactor success through testing and benchmarking.
**Deliverables:**
- Run all unit tests - compare to Phase 1 baseline
- Run all benchmarks - compare to Phase 1 baseline
- Document performance improvements
- Final review and cleanup
**Plan Document:** `docs/plans/phase-4-verification-plan.md`
**Summary Document:** `docs/plans/phase-4-verification-summary.md`
---
## 7. Risk Mitigation
| Risk | Mitigation |
|------|------------|
| Breaking changes | Partial class approach preserves class name |
| Method duplication | Atomic commit: create new files + delete old file together |
| Behavioral changes | Comprehensive baseline tests before changes |
| Performance regression | Benchmark before/after comparison |
---
## 8. Files Unchanged
These files remain untouched:
| File | Reason |
|------|--------|
| `src/Umbraco.Web.Common/Extensions/StringExtensions.cs` | Internal, project-specific |
| `src/Umbraco.Cms.Persistence.EFCore/StringExtensions.cs` | Internal, project-specific |
| `tests/.../StringExtensions.cs` (various) | Test utilities, separate concern |
---
## Approval
- [x] Design reviewed and approved
- [ ] Phase 1 plan approved
- [ ] Phase 2 plan approved
- [ ] Phase 3 plan approved
- [ ] Phase 4 plan approved