---
name: data-extraction-workflow
description: Builds data extraction workflows from documents, emails, or web sources. Use when automating data entry, parsing unstructured documents, or extracting information at scale.
metadata:
  category: ai-automation
  author: skillar
  version: "1.0"
---

# Data Extraction Workflow

> **Usage:** Copy this skill into Claude → replace [BRACKETS] with your details → get polished output.

## What You Get
A production-ready data extraction workflow specification with source mapping, extraction rules, validation logic, and output formatting — designed to turn unstructured data into clean, structured records.

## Instructions

You are a data engineering specialist focused on intelligent document processing and automated data extraction. You have designed extraction pipelines that process millions of documents using OCR, NLP, and LLM-based extraction, consistently achieving 95%+ accuracy rates while handling messy real-world data.

Given the following extraction requirements:
- **Source type(s):** [DESCRIBE_SOURCES, e.g., PDF invoices, email threads, web pages, scanned forms, spreadsheets]
- **Data fields to extract:** [LIST_SPECIFIC_FIELDS, e.g., company name, date, amount, line items, contact info]
- **Volume:** [DOCUMENTS_PER_DAY_OR_WEEK]
- **Current manual process:** [DESCRIBE_HOW_THIS_IS_DONE_TODAY]
- **Output destination:** [WHERE_EXTRACTED_DATA_GOES, e.g., database, spreadsheet, CRM, API]
- **Accuracy requirements:** [ACCEPTABLE_ERROR_RATE, e.g., 99% for financial data, 95% for general]
- **Available tools:** [LIST_AVAILABLE_PLATFORMS, e.g., Python, Zapier, AWS, Google Cloud]

Complete the following:

## 1. SOURCE ANALYSIS AND CLASSIFICATION
- Categorize all source documents by type, format, and structural consistency
- Assess the variability within each source type (are invoices from one vendor or hundreds?)
- Identify structured vs. semi-structured vs. fully unstructured content in each source
- Map the location of each target data field within typical document layouts
- Flag sources requiring OCR preprocessing (scanned documents, images, handwritten text)
- Estimate extraction difficulty per source type on a 1-5 scale

## 2. EXTRACTION STRATEGY DESIGN
- Select the optimal extraction method for each source-field combination: rule-based regex, template matching, LLM-based extraction, or hybrid
- Write specific extraction prompts for LLM-based fields including format instructions and examples
- Define regex patterns for structured fields like dates, amounts, email addresses, and phone numbers
- Design a document classification step that routes documents to the correct extraction pipeline
- Create entity resolution logic for matching extracted names and companies to existing records
- Build a confidence scoring mechanism for each extracted field

## 3. DATA VALIDATION AND CLEANING
- Define validation rules for each extracted field (data type, format, range, required vs. optional)
- Create cross-field validation checks (e.g., line item totals must sum to invoice total)
- Design a fuzzy matching system for standardizing extracted values against known lists
- Build deduplication logic for records extracted from overlapping sources
- Create a data normalization layer that converts all dates, currencies, and units to standard formats
- Define exception handling for fields that fail validation with routing to manual review

## 4. PIPELINE ARCHITECTURE
- Design the end-to-end pipeline flow from source ingestion to output delivery
- Specify the preprocessing steps: file conversion, OCR, text cleaning, language detection
- Define the batch processing strategy (real-time, micro-batch, or scheduled batch)
- Map the integration points with source systems (email inbox, file share, API, web scraper)
- Design the output formatting and delivery mechanism for each destination system
- Include a staging area for human review of low-confidence extractions before final output

## 5. ERROR HANDLING AND MONITORING
- Define failure categories: unreadable source, missing required fields, confidence below threshold, format mismatch
- Design retry logic for transient failures (OCR retries with different settings, re-prompting LLM)
- Create an exception queue with prioritized manual review workflow
- Build monitoring dashboards tracking: documents processed, extraction accuracy, field-level confidence, error rates
- Set up alerting for accuracy drops or processing backlogs
- Design a feedback loop where manual corrections are used to improve extraction rules

## 6. SCALING AND OPTIMIZATION
- Estimate processing costs per document and monthly total at current and projected volumes
- Identify optimization opportunities to reduce API calls and processing time
- Design a caching strategy for frequently occurring patterns and entities
- Plan for volume spikes with autoscaling or queue management
- Create a model fine-tuning plan using accumulated labeled data from manual corrections
- Define quarterly review milestones for accuracy improvement and cost reduction targets

Deliver the complete workflow as a technical specification with extraction rules, validation logic, pipeline architecture diagrams described in text, and all LLM prompts ready to deploy. The output should enable a developer to build the pipeline with minimal ambiguity.

Be specific to my situation. No generic filler.
