Product specification extraction and bill of materials generation
- Published
Product specification extraction from technical documents is a task that kills productivity. You receive a PDF datasheet, a product manual, or technical specification, and someone needs to manually read through it, identify key specifications, cross-reference component details, and build a bill of materials. This takes hours. It's error-prone. It's boring work that distracts from actual engineering............. For more on this, see Technical specifications extraction from product requirem....
The real frustration emerges when you need to do this repeatedly. Ten datasheets become fifty. Fifty becomes two hundred. By that point, you're either hiring someone to do manual data entry (expensive and slow) or building a custom script that breaks whenever document formats change slightly. Neither option is good.
This workflow solves that problem by chaining three AI tools together with zero manual handoff. You upload a document, and out the other end comes structured specifications and a complete bill of materials, ready to import into your inventory system or manufacturing planning software.
The Automated Workflow
This workflow uses four distinct steps: document ingestion, specification extraction, component identification, and bill of materials generation. We'll walk through using n8n as the orchestration layer, since it offers excellent support for file handling and provides good visibility into data flow. For more on this, see Game design document generation from concept pitch and me....
How the workflow works
Here's what happens at each stage:
- A user uploads a PDF (product datasheet, technical manual, or specification sheet) to your workflow
- Chat with PDF by Copilotus extracts the text and key information from the document
- ParSpec AI parses the extracted specifications and identifies technical parameters
- PDnob Image Translator AI processes any embedded schematics or diagrams to extract component references
- The orchestration tool (n8n in this example) combines all outputs and formats them into a structured bill of materials
The critical insight here is that each tool handles what it does best. Chat with PDF is excellent at understanding document structure and answering questions about content. ParSpec specialises in parsing technical specifications into structured data. PDnob handles visual components. None of them require manual intervention between steps.
Setting up n8n for this workflow
Start a new workflow in n8n. You'll need the following nodes:
- HTTP Request node to handle file uploads
- Chat with PDF node
- ParSpec AI node
- PDnob Image Translator AI node
- Function node to format the bill of materials
- Output node to return the results
Create the trigger first. Use a Webhook node set to accept POST requests:
POST /workflows/product-specs
Content-Type: multipart/form-data
This endpoint becomes the API consumers call to submit documents.
Step 1:
Upload and extract via Chat with PDF
The Chat with PDF by Copilotus API accepts a PDF file and a query string. Your first n8n node should parse the incoming webhook and prepare the file.
Add an HTTP Request node configured like this:
Method: POST
URL: https://api.copilotus.com/v1/chat-with-pdf
Headers:
Authorization: Bearer YOUR_COPILOTUS_API_KEY
Content-Type: multipart/form-data
Body (form data):
file: [incoming file from webhook]
query: Extract all technical specifications, product name, version, and key features from this document
response_format: json
The response includes extracted text and identified key sections. Store this in a variable for the next step.
Step 2:
Parse specifications with ParSpec AI
ParSpec AI takes unstructured text and returns structured specification data. It understands electrical parameters, mechanical dimensions, operating conditions, and component details.
Add a second HTTP Request node:
Method: POST
URL: https://api.parspec.ai/v1/parse
Headers:
Authorization: Bearer YOUR_PARSPEC_API_KEY
Content-Type: application/json
Body:
{
"text": "{{ $node['Chat with PDF'].data.extracted_content }}",
"document_type": "technical_specification",
"include_metadata": true,
"extract_components": true
}
ParSpec returns a JSON object containing structured specifications:
{
"product_name": "Industrial Temperature Sensor",
"product_version": "v2.1",
"specifications": {
"operating_temperature": {
"min": -40,
"max": 125,
"unit": "celsius"
},
"accuracy": {
"value": 0.5,
"unit": "celsius"
},
"interface": "RS-485"
},
"components": [
{
"part_number": "NTC-103F",
"description": "Thermistor",
"quantity": 1
}
]
}
Store this output for use in the next steps.
Step 3:
Extract visual component data
Some documents contain schematics or diagrams showing component placement and connections. PDnob Image Translator AI extracts component references from images embedded in the PDF.
Add another HTTP Request node to process any images from the original document:
Method: POST
URL: https://api.pdnob.com/v1/translate-image
Headers:
Authorization: Bearer YOUR_PDNOB_API_KEY
Content-Type: multipart/form-data
Body (form data):
image: [image extracted from PDF]
analysis_type: component_identification
output_format: json
The response identifies components visible in diagrams:
{
"detected_components": [
{
"reference_designator": "R1",
"component_type": "resistor",
"value": "10k",
"notes": "pull-up resistor"
},
{
"reference_designator": "C1",
"component_type": "capacitor",
"value": "100uF",
"voltage_rating": "16V"
}
]
}
Step 4:
Combine and format the bill of materials
This is where you use an n8n Function node to combine all extracted data into a structured bill of materials. The function receives the outputs from ParSpec and PDnob and merges them intelligently.
// Combine ParSpec structured specs with PDnob visual components
const parsedSpecs = $node['ParSpec AI'].data.specifications;
const visualComponents = $node['PDnob Image Translator'].data.detected_components;
const parspecComponents = $node['ParSpec AI'].data.components;
// Merge component lists
const allComponents = {};
// Add components from ParSpec
parspecComponents.forEach(comp => {
const key = comp.part_number;
allComponents[key] = {
part_number: comp.part_number,
description: comp.description,
quantity: comp.quantity,
source: 'specification_text'
};
});
// Add components from visual analysis
visualComponents.forEach(comp => {
const key = comp.reference_designator;
if (!allComponents[key]) {
allComponents[key] = {
reference_designator: comp.reference_designator,
component_type: comp.component_type,
value: comp.value,
quantity: 1,
source: 'schematic_diagram'
};
}
});
// Format final bill of materials
const billOfMaterials = {
product_name: parsedSpecs.product_name,
product_version: parsedSpecs.product_version,
generated_timestamp: new Date().toISOString(),
specifications: parsedSpecs,
bill_of_materials: Object.values(allComponents),
component_summary: {
total_line_items: Object.values(allComponents).length,
total_parts: Object.values(allComponents).reduce((sum, item) => sum + (item.quantity || 1), 0)
}
};
return { data: billOfMaterials };
This function ensures that components identified through different methods are deduplicated and organised logically.
Step 5:
Output and delivery
Add a final node to return the structured bill of materials. You can either:
- Return it directly as a JSON response to the webhook caller
- Store it in a database (MongoDB, PostgreSQL) for later retrieval
- Send it to cloud storage (AWS S3, Google Cloud Storage)
- Forward it to your CAD system or inventory management software via API
For direct return, use an HTTP Response node:
Status Code: 200
Body:
{
"success": true,
"data": "{{ $node['Format BOM'].data }}"
}
For storage in a database, use an n8n MongoDB node:
Collection: product_specifications
Operation: Insert
Data:
{
"product_name": "{{ $node['Format BOM'].data.product_name }}",
"bom": "{{ $node['Format BOM'].data.bill_of_materials }}",
"created_at": "{{ $now.toISOString() }}",
"source_document": "{{ $node['Webhook'].data.filename }}"
}
The Manual Alternative
If you prefer more control over the extraction process, you can build a semi-automated workflow using Claude Code. This approach extracts the document and shows you a preview of the identified specifications before committing them to your system.
Upload your PDF to a Claude Code environment, then use the following prompt:
I have a technical specification PDF. Please:
1. Extract all product specifications and parameters
2. Identify all component references and part numbers
3. List any operating conditions or environmental requirements
4. Format the results as JSON
5. Highlight any fields where you're uncertain about the extraction
Claude Code parses the document interactively, and you can ask follow-up questions if anything looks incorrect. Once satisfied, you export the JSON and import it into your system manually. This takes longer but gives you confidence in the data before it enters your database.
This approach works well if you process documents infrequently or if your documents have highly variable formats that require human judgment.
Pro Tips
1. Handle rate limiting gracefully. ParSpec AI and PDnob both have rate limits. If you're processing many documents, add a delay node between requests. Set it to 2-3 seconds between API calls:
n8n Delay Node Configuration:
Type: Duration
Duration: 3 seconds
This prevents you hitting rate limits and getting errors. You can also implement exponential backoff in a custom Function node if you expect occasional overages.
2. Validate extraction confidence scores. Most AI extraction APIs return confidence scores. Check them. If ParSpec returns a specification with confidence below 75%, flag it for manual review rather than importing it directly:
const specs = $node['ParSpec AI'].data.specifications;
const flaggedForReview = [];
Object.entries(specs).forEach(([key, value]) => {
if (value.confidence && value.confidence < 0.75) {
flaggedForReview.push({ field: key, confidence: value.confidence });
}
});
if (flaggedForReview.length > 0) {
// Send email to engineer for manual review
return { review_required: true, flagged_items: flaggedForReview };
}
This prevents bad data from silently entering your system.
3. Deduplicate components intelligently. When you merge outputs from multiple tools, you'll sometimes get the same component referenced differently. Use fuzzy matching to catch these:
const Levenshtein = require('levenshtein');
function isSameComponent(comp1, comp2) {
// Check exact match first
if (comp1.part_number === comp2.part_number) return true;
// Check fuzzy match on description
const distance = new Levenshtein(
comp1.description.toLowerCase(),
comp2.description.toLowerCase()
).distance;
return distance < 3; // Allow up to 2 character differences
}
This prevents duplicate line items in your bill of materials.
4. Cache documents for faster re-processing. If you need to extract different data from the same document later, store the extracted content from Chat with PDF in your database. Re-processing the same file is much faster than re-uploading and re-parsing it:
// Check if we've already extracted this document
const cachedExtraction = await mongodb.findOne({
collection: 'extracted_documents',
query: { source_hash: md5(document_content) }
});
if (cachedExtraction) {
return cachedExtraction.data; // Skip re-extraction
}
This saves API costs and processing time, especially valuable if you maintain a reference library of datasheets.
5. Set up error notifications. When a document fails to process, you need to know about it. Add a conditional node that sends an alert if any step fails:
Conditions:
If Chat with PDF status !== 200
OR ParSpec AI status !== 200
OR PDnob Image Translator status !== 200
Then: Send Slack message to #engineering-alerts
Message: "BOM extraction failed for {{ filename }}. Error: {{ error_message }}"
This prevents failed extractions from silently disappearing into the ether.
Cost Breakdown
| Tool | Plan Needed | Monthly Cost | Notes |
|---|---|---|---|
| Chat with PDF by Copilotus | Professional | £49 | Includes 5,000 page extractions; excess pages at £0.01 each |
| ParSpec AI | Standard | £79 | Covers 10,000 specifications; additional at £0.008 per spec |
| PDnob Image Translator AI | Business | £99 | Includes 2,000 image analyses; overage at £0.05 per image |
| n8n | Self-hosted (free) or Cloud Pro | £0–£100 | Self-hosted is free; Cloud Pro is £100/month for higher execution limits |
| Make (Integromat) | Alternative orchestration | £9.99–£299 | Pay-as-you-go if you prefer; generally cheaper for low volume |
| Zapier | Alternative orchestration | £19.99–£599 | More expensive but easier setup; good if you lack technical resources |
For a typical workflow processing 50 documents per month with average 10 specifications and 5 images per document, expect around £200–£230 monthly if self-hosting n8n. If you opt for cloud orchestration or need higher processing volumes, costs scale accordingly.
The real saving comes from not hiring someone to do this manually. A technical writer or engineer spending 2-3 hours per document extraction costs far more than these tools combined.
More Recipes
User onboarding video series from feature documentation
SaaS companies need to convert technical documentation into engaging onboarding videos for different user segments.
Course curriculum and assessment generation from subject outline
Educators spend weeks designing course materials and assessments when they could generate them from a high-level curriculum outline.
Technical documentation generation from code
Developers struggle to maintain up-to-date documentation alongside code changes.