VPCE-Cleanup Decision FrameworkΒΆ
Purpose: Data-driven VPCE-cleanup prioritization with MUST/SHOULD/Could recommendations based on costs, metadata, and usage patterns
Scope: X VPC Endpoints across Y AWS accounts with $Z last month/year actual spend (Cost Explorer validation)
Implementation: Runbooks CLI + JupyterLab Notebook Workflow
1. Business Decision FrameworkΒΆ
Two-Gate Scoring SystemΒΆ
graph LR
A[VPC Endpoints<br/><b>$Z</b> last month actual] --> B{1οΈβ£ Gate A<br/>Business/Security Filter}
B -->|BLOCKED<br/>Regulatory/Critical| C[KEEP<br/>No action]
B -->|PASS| D[2οΈβ£ Gate B<br/>Technical Scoring]
D --> E[Cost 40%<br/>Usage 30%<br/>Overlap 15%<br/>DNS 15%]
E --> F{Total Score}
F -->|β₯80 points| G[MUST<br/>Decommission]
F -->|50-79 points| H[SHOULD<br/>Decommission]
F -->|<50 points| I[Could<br/>Review]
style A fill:#e1f5ff
style B fill:#fff4e6
style C fill:#90ee90
style D fill:#ffe6e6
style E fill:#f0e6ff
style G fill:#ff6b6b
style H fill:#ffa726
style I fill:#66bb6a
Scoring RubricΒΆ
| Component | Weight | Data Source | Conservative Default |
|---|---|---|---|
| Cost Percentile | 40% | Cost Explorer last month/year actual | Pandas P20/P50/P80/P95/P99 |
| Usage Activity | 30% | CloudTrail (future) | 15/30 points (moderate) |
| Overlap/Duplicates | 15% | Service+VPC grouping | 0 or 15 points |
| DNS/Audit Signals | 15% | Resolver+CloudTrail (future) | 0 points (no penalty) |
Conservative Default Principle: Missing data = neutral score (prevents false-positive MUST classifications)
Classification ThresholdsΒΆ
| Category | Score Range | Confidence | Description |
|---|---|---|---|
| MUST Decommission | β₯80 points | High | Gate B β₯80 AND Gate A passes |
| SHOULD Decommission | 50-79 points | Medium | Strong evidence, review recommended |
| Could Review | <50 points | Low | Insufficient data, further analysis needed |
Business ValueΒΆ
X Endpoints Analyzed:
- Last Month/Year Actual Cost: $21,557.59 (Cost Explorer validation)
- 4 AWS Accounts: Multi-tenant cleanup opportunity
- 79 Duplicates: 89.8% duplication rate (major optimization)
- Cost Distribution: P20=$10.89, P50=$21.13, P80=$31.64, P95=$51.33, P99=$89.67
Current Recommendations (Conservative Defaults):
- 46 SHOULD Decommission: $15,324.85/year (medium confidence)
- 42 Could Review: $6,232.74/year (low confidence, needs investigation)
Manager Decision CriteriaΒΆ
Approval Gates:
- Cost Accuracy: Cost Explorer last month/year actual (NOT projections)
- Conservative Defaults: Moderate usage assumed (15/30 points) without telemetry
- LEAN Format: β€3 pages, <5 minute review time
- PDCA Validation: 98.7% AWS API accuracy (74/75 endpoints exist)
Decision Process:
- Review SHOULD recommendations (46 endpoints, $15,324.85/year)
- Validate Could classifications (42 endpoints, $6,232.74/year)
- Approve Phase 5-6 usage telemetry activation (CloudTrail, Resolver)
Decision Framework Architecture:
- Start with simple metadata (type, status, age, AZ count, tags)
- Layer in cost attribution (per-resource monthly/annual)
- Add usage metrics (Flow Logs, Resolver, CloudTrail)
2. Technical ArchitectureΒΆ
System ArchitectureΒΆ
%%{init: {
"themeVariables": {
"primaryColor":"#1a376a",
"edgeLabelBackground":"#f4f8ff",
"secondaryColor":"#f9fafb",
"tertiaryColor":"#f3f5fd",
"background":"#f4f7fb",
"nodeTextColor":"#1a2548",
"fontFamily": "Inter, Segoe UI, Arial"
}
}}%%
flowchart LR
%% Inputs/policy - business colors and icons
Policy{{"π Policy/Config: Idle β₯30d Regions | Allow/Deny"}}:::policy
CostData[/"π‘ Cost Explorer CUR Monthly/Avg Cost"/]:::input
CTrail[/"π‘ CloudTrail Events Access & Last Used"/]:::input
VPCep[/"π‘ VPC Endpoints Type, State, Subnets, SGs"/]:::input
AWSOrg[/"π‘ AWS Organizations Account, Tags, OU"/]:::input
%% 6 stages in a row (horizontally)
Step1["βοΈ Step 1: Load Data"]:::step
Step2["βοΈ Step 2: Enrich Metadata"]:::step
Step3["π Step 3: Cost Analysis"]:::step
Step4["π‘οΈ Step 4: Validate & Guardrail"]:::step
Step5["ποΈ Step 5: Export & Audit"]:::step
Step6["β
Step 6: Cleanup & Approval"]:::step
%% Detail & Output cards, under each stepβdirectly associated
D1["Source all Org/Account, Endpoint, CloudTrail, Billing Data. Apply Region/Policy filters. Preprocess to dataset."]:::card
D2["Attach Org Tags, OU, Owner. Enrich: VPC, CIDR, last access/user, idle status."]:::card
D3["Calculate VPC Endpoint cost, rollup OU/Account/Service. Estimate Monthly savings.`"]:::card
D4["Live AWS verification (avoid staleness). Enforce policies: ENI, DNS, safety. Detect anomalies, flag for review."]:::card
D5["Export CSV/JSON report. Generate audit log."]:::card
D6["Create clean-up script (dry run). Add rollback info, submit to manager for signoff."]:::card
%% Outputs: rightmost column
CleanScript["πͺ Cleanup Script (Runbooks-CLI/Terraform)"]:::output
ManagerNote["π Manager Approval with Rollback Plan"]:::output
Exports["π CSV/JSON Export"]:::output
AuditLog["π Audit Log"]:::output
%% Connections - "vertical columns"
Policy -.-> Step1
Policy -.-> Step4
CostData --> Step1
CTrail --> Step1
VPCep --> Step1
AWSOrg --> Step1
Step1 --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step1 --> D1
Step2 --> D2
Step3 --> D3
Step4 --> D4
Step5 --> D5
Step6 --> D6
D5 --> Exports
D5 --> AuditLog
D6 --> CleanScript
D6 --> ManagerNote
%% Class styles for clarity
classDef step fill:#1a376a,stroke:#233e57,stroke-width:2.5px,color:#fff,rx:14,ry:14,font-size:17px,font-weight:bold;
classDef input fill:#e7f1fb,stroke:#5ca8e8,stroke-width:1.5px,color:#062b5f,rx:10,ry:10;
classDef policy fill:#ffd753,stroke:#ebbb38,stroke-width:2px,color:#373006,rx:13,ry:13;
classDef card fill:#fafdff,stroke:#a1b1e7,stroke-width:1.6px,color:#132e59,rx:10,ry:10,font-size:13.6px,font-style:italic;
classDef output fill:#edfff5,stroke:#47b47e,stroke-width:1.7px,color:#147838,rx:11,ry:11,font-size:15px,font-weight:bold;
class Policy policy;
class CostData,CTrail,VPCep,AWSOrg input;
class Step1,Step2,Step3,Step4,Step5,Step6 step;
class D1,D2,D3,D4,D5,D6 card;
class CleanScript,ManagerNote,Exports,AuditLog output;
linkStyle default stroke:#8da6eb,stroke-width:1.2px;
graph LR
A[vpce-cleanup.csv<br/>88 endpoints] --> B[VPCECleanupManager<br/>Python Class]
B --> C[Cost Explorer API<br/>Last month/year actual costs]
C --> D[Scoring Engine<br/>Two-Gate Framework]
B --> E[EC2 API<br/>Validation 74/75]
D --> F[Recommendations<br/>MUST/SHOULD/Could]
F --> G[Markdown Export<br/>mkdocs-compatible]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#e8f5e9
style D fill:#fce4ec
style E fill:#f3e5f5
style F fill:#fff9c4
style G fill:#e0f2f1
Conservative Defaults MatrixΒΆ
| Component | Real AWS Integration | Conservative Default | Rationale |
|---|---|---|---|
| Usage Activity | CloudTrail data events | 15/30 points (moderate) | Assume moderate usage without telemetry |
| DNS Signals | Route 53 Resolver logs | 0/15 points (no penalty) | Missing data = neutral score |
| Overlap Detection | Service+VPC grouping | 0 or 15 points | Deterministic from CSV |
| Cost Percentile | Cost Explorer actual | Pandas percentile calculation | Real historical spend |
Design Philosophy: Conservative defaults prevent false-positive MUST classifications while enabling decision framework testing without full AWS telemetry.
API IntegrationsΒΆ
AWS Services:
- Cost Explorer:
ce:GetCostAndUsagefor last month/year actual VPC Endpoint costs by service - EC2 API:
ec2:DescribeVpcEndpointsfor metadata validation (74/75 validated, 98.7% accuracy) - Billing Profile:
${AWS_BILLING_PROFILE}
Future Integrations (Phase 5-6):
- CloudTrail: Data events for usage activity scoring (30% weight)
- Route 53 Resolver: DNS query logs for endpoint usage patterns (15% weight)
- VPC Flow Logs: Network traffic analysis for unused endpoint detection
ImplementationΒΆ
Technology Stack:
- Language: Python 3.11+ with type hints
- Data Processing: Pandas for percentile calculations, grouping, aggregations
- Validation: Pydantic models for schema enforcement
- AWS SDK: boto3 for Cost Explorer + EC2 API calls
- CLI Output: Rich library for professional terminal formatting (tables, colors, status indicators)
Module Location: src/runbooks/vpc/vpce_cleanup_manager.py
Key Methods:
enrich_with_metadata(): Collect endpoint metadata (type, status, age, AZ count, tags)enrich_with_last_month_costs(): Attribute costs per endpoint from Cost Explorerget_decommission_recommendations(): Apply two-gate scoring frameworkgenerate_markdown_table(): Export mkdocs-compatible markdown
3. Operational WorkflowΒΆ
Notebook Execution WorkflowΒΆ
graph LR
A[Cell 1-2: Initialize<br/>Load CSV 88 endpoints] --> B[Cell 5: Enrich<br/>Cost Explorer last month/year actual]
B --> C[Cell 11: Validate<br/>EC2 API 74/75 exists]
C --> D[Cell 18: Score<br/>Two-Gate Framework]
D --> E[Cell 22: Export<br/>Markdown mkdocs]
E --> F[Manager Review<br/><5 min approval]
style A fill:#bbdefb
style B fill:#c8e6c9
style C fill:#fff9c4
style D fill:#ffccbc
style E fill:#d1c4e9
style F fill:#c5e1a5
Notebook: notebooks/vpc/vpce-cleanup-manager-operations.ipynb
Workflow Steps:
- Initialize (Cells 1-2): Load CSV, configure AWS profile, initialize VPCECleanupManager
- Enrich (Cell 5): Call Cost Explorer API for last month/year actual spend by service
- Validate (Cell 11): Cross-validate 88 endpoints exist via EC2 API (98.7% accuracy)
- Score (Cell 18): Apply two-gate framework, generate MUST/SHOULD/Could recommendations
- Export (Cell 22): Generate mkdocs-compatible markdown with complete metadata
- Approve (Manager): Review <5 minutes, approve cleanup actions
PDCA Validation RequirementsΒΆ
Completion Criteria:
- β All 88 endpoints processed with two-gate scoring
- β Conservative defaults applied (usage: 15/30, DNS: 0/15)
- β AWS API validation: 74/75 endpoints exist (98.7% accuracy)
- β Cost Explorer: Last month/year actual spend $21,557.59
- β Recommendation Breakdown: 46 SHOULD + 42 Could = 88 total
- β Percentile Distribution: P20=$10.89, P50=$21.13, P80=$31.64, P95=$51.33, P99=$89.67
- β Markdown export: mkdocs-compatible format with complete metadata
Quality Gates:
- Cost accuracy: Cost Explorer actual (NOT projections)
- API validation: β₯95% accuracy (98.7% achieved)
- Manager review: <5 minutes (LEAN format)
- Evidence-based: Complete audit trail without SHA256 checksums
Next StepsΒΆ
Immediate Actions:
- Review this spec: Manager reviews business + technical alignment (<5 min)
- Approve SHOULD recommendations: 46 endpoints, $15,324.85/year opportunity
- Investigate Could classifications: 42 endpoints, $6,232.74/year (needs telemetry)
Optional Enhancements (Future Phases):
- Phase 5: Activate CloudTrail data events for usage activity (30% weight)
- Phase 6: Enable Route 53 Resolver logs for DNS signals (15% weight)
- Phase 7: Integrate VPC Flow Logs for network traffic analysis
- Phase 8: Design alternatives (Gateway vs Interface, hub-spoke architecture)
Expected Outcomes:
- With Conservative Defaults: 46 SHOULD, 42 Could (current state)
- With Usage Telemetry: Expect 5-10 MUST classifications (high confidence)
- With Complete Telemetry: Refined SHOULD/Could distribution based on real usage
Business Value: Data-driven VPCE cleanup with $21,557.59/year optimization opportunity across 88 endpoints
Technical Excellence: Conservative defaults + real AWS integration + professional mermaid diagrams
Manager Approval: PENDING review + approval for SHOULD recommendations
def enrich_with_metadata(self) -> Dict:
"""Enrich endpoints with simple metadata for decision framework.
Metadata Fields:
- endpoint_type: Interface/Gateway/GatewayLoadBalancer
- status: available/pending/deleting/deleted
- service_name: Service endpoint connects to
- age_days: Days since creation (datetime.now() - creation_time)
- az_count: Number of availability zones
- is_multi_az: Boolean (az_count > 1)
- tags: Stage, Owner, CostCenter, EndpointId
"""
Per-Resource Cost Attribution: enrich_with_last_month_costs()ΒΆ
monthly_cost: Last month actual spend per endpointannual_cost: Last 12 months actual spend per endpointannual_cost_estimate: monthly_cost Γ 12 (conservative projection) ?
Distribution Logic: Equal distribution across endpoints by service (conservative approach)
2.3 Decision Rubric Framework (lines 1017-1169)ΒΆ
def get_decommission_recommendations(self) -> pd.DataFrame:
"""Generate MUST/SHOULD/Could decommission recommendations.
Classification Logic:
MUST Decommission (high confidence):
- Status != "available" AND age_days > 30
- monthly_cost > $500
- Missing required tags (Stage/Owner/CostCenter)
SHOULD Decommission (medium confidence):
- age_days > 365 AND monthly_cost > $100
- is_multi_az == False AND monthly_cost > $50
- Service name contains test/dev/sandbox/temp
Could Decommission (review recommended):
- monthly_cost > $20 AND age_days > 180
- Default for all others
"""
Output: Rich CLI summary table with color-coded tiers (red/yellow/green)
2.4 Enhanced CSV ExportΒΆ
age_days: Endpoint age in daysaz_count: Number of availability zonesrecommendation: MUST/SHOULD/Could-
recommendation_reason: Explanation for classification -
Decision Framework Summary (MUST/SHOULD/Could breakdown)
- Top 3 endpoints per recommendation tier
- Complete detailed table with all enriched columns
Real AWS Integration - Prerequisites:ΒΆ
- AWS SSO credentials:
aws sso login --profile [profile-name] - Validate permissions:
ce:GetCostAndUsage,ec2:DescribeVpcEndpoints
Activation Steps:
# In src/runbooks/vpc/vpce_cleanup_manager.py line 792-802
# UNCOMMENT these lines:
response = ec2.describe_vpc_endpoints(VpcEndpointIds=[endpoint_id])
endpoint = response['VpcEndpoints'][0]
age_days = (datetime.now() - endpoint['CreationTimestamp']).days
status = endpoint['State']
az_count = len(endpoint.get('SubnetIds', []))
metadata β costs β usage β recommendations
-
Phase 2: Usage Metrics Collection (~8 hours) - VPC Flow Logs integration (CloudWatch Logs Insights) - Route 53 Resolver logs integration - CloudTrail data events analysis - Enhance decision rubric with usage-based rules
-
Phase 3: Cost Attribution Enhancement (~4 hours) - Per-resource cost tagging via Cost Explorer USAGE_TYPE - Cost allocation tag enforcement - Top 20-30% spend focus
-
Phase 4: Design Alternatives Analysis (~4 hours)
- Gateway endpoints vs Interface endpoints cost comparison
- Hub-spoke architecture evaluation
- NAT Gateway alternatives
Gap Analysis MethodologyΒΆ
When an implementation plan's cost baseline differs significantly from measured reality, use this structured methodology to detect the discrepancy, identify the root cause, and recalibrate scope before committing to remediation work.
Plan-vs-Reality Audit TableΒΆ
Compare every quantitative claim in the implementation plan against live AWS data collected by the runbooks CLI. Flag discrepancies greater than 20% for root cause investigation.
| Metric | Plan Claim | Measured Reality | Discrepancy |
|---|---|---|---|
| Monthly Cost Baseline | $X | $Y (Cost Explorer) | +/- % |
| Annual Savings Target | $X | $Y (extrapolated) | +/- % |
| Transit Gateways (owned) | N | N (describe-transit-gateways) | +/- % |
| NAT Gateways | N | N (describe-nat-gateways) | +/- % |
| VPC Endpoints | N | N (describe-vpc-endpoints) | +/- % |
| VPN Connections | N | N (describe-vpn-connections) | +/- % |
A discrepancy exceeding 50% across multiple resource types is a strong signal of an account scope mismatch β the plan was written against a different account (for example, a central networking hub account) than the one being analyzed (a spoke account).
Root Cause Decision TreeΒΆ
Large cost discrepancy detected?
βββ YES: Check resource counts (TGW, NAT GW, VPN)
β βββ Plan claims N resources but 0 found in Account-A:
β β βββ PROBABLE CAUSE: Plan targets Account-B (hub vs spoke)
β β Action: Identify the correct hub account and rerun inventory
β βββ Partial mismatch (60-80% fewer resources):
β βββ PROBABLE CAUSE: Plan aggregates multi-account totals
β Action: Run org-wide Config Aggregator query to compare
βββ NO: Proceed with original scope
Phased Remediation ApproachΒΆ
Once the correct scope is established, sequence remediation from lowest risk to highest:
| Phase | Resource Type | Action | Risk |
|---|---|---|---|
| 1 | Idle EIPs | Release unattached elastic IPs | Very Low |
| 2 | NAT Gateways | Consolidate to minimum per-AZ | Low |
| 3 | VPC Endpoints | Remove duplicates (same service, same VPC) | Low-Medium |
| 4 | Data Transfer | Route traffic through existing endpoints | Medium |
Validate each phase independently before proceeding to the next. Use --dry-run to preview changes before execution.
Stakeholder Expectation Reset ProtocolΒΆ
When measured savings are materially lower than the plan claimed:
- Document the gap β record plan claim, measured value, and percentage difference with evidence source (Cost Explorer date range, API call used).
- Identify the correct scope β determine whether the plan targeted a hub account, aggregated org-wide totals, or a different time period.
- Recalculate validated savings β use
runbooks finops dashboardwith the correct account profile to produce a defensible baseline. - Communicate early β present the corrected business case before implementation begins, not after. A smaller but validated savings figure is more credible than a large figure that cannot be reproduced.
- Template for reuse β if the spoke-account optimization pattern is validated, document per-account savings and extrapolate conservatively across the organization. Present extrapolation separately from validated figures and label it clearly as a projection.
5S Audit Checklist for Network InventoryΒΆ
Before building a remediation plan, run a 5S audit to establish a clean baseline:
| Step | Check | Tool |
|---|---|---|
| Sort | Identify resources with no traffic in 30+ days | VPC Flow Logs + runbooks vpc |
| Set in Order | Confirm each resource maps to a tagged workload | ec2:DescribeVpcEndpoints + tag audit |
| Shine | Remove duplicate endpoints (same service, same VPC) | vpce_cleanup_manager.py overlap detection |
| Standardize | Enforce naming convention and required tags (Stage, Owner, CostCenter) | Config rule + runbooks security |
| Sustain | Schedule quarterly review with Cost Explorer actual spend | Recurring runbook + DORA metric |