morethanadiagnosis-hub/HANDOFF_WEBSITE_SCRAPER.md
admin da63a31c95 feat: add website scraper and handoff documentation for claude-web
- Create Puppeteer-based scraper for morethanadiagnosis.org
- Extract full page structure, content, navigation, and images
- Generate JSON output with 13 headings, 24 paragraphs, 22 CTAs, 34 links, 15 images
- Add comprehensive handoff doc with implementation guide for frontend
- Document all website sections: Happy Mail, Support, Podcast, Resources, Shop
- Include content themes and recommendations for Next.js components
2025-11-18 17:17:43 +00:00

8.1 KiB

Handoff: Website Content Scraper & Frontend Implementation

Date: November 18, 2025 From: Claude (CL) To: Claude Web Status: Website content extracted and ready for frontend implementation


Overview

A Puppeteer-based web scraper has been successfully created to extract and analyze the morethanadiagnosis.org website. All content, structure, navigation, and assets have been captured and saved for frontend replication.


What's Been Completed

Web Scraper Created

  • Location: /srv/containers/mtad-api/scraper.js
  • Technology: Puppeteer (headless browser automation)
  • Purpose: Dynamically render and extract JavaScript-heavy Wix website content

Content Successfully Extracted

  • Output: /srv/containers/mtad-api/website_content.json
  • Format: Structured JSON with all page elements

Data Captured

  • 13 Headings - All H1-H6 elements across the page
  • 24 Paragraphs - Body text and descriptions
  • 22 Buttons/CTAs - Call-to-action elements
  • 34 Links - Navigation and external links
  • 15 Images - Images with alt text and URLs
  • 10 Sections - Major content sections
  • Full text - Complete rendered page content

Extracted Website Structure

Navigation Menu

Home
├── Podcast
├── Resources
├── Happy Mail
├── Support Group
├── Support Circle
├── The Journal
├── In Loving Memory
├── Connect With Us
└── Shop

Key Pages & Content Areas

1. Homepage Hero

  • Title: "You are more than a diagnosis"
  • Tagline: "Connecting Through Stories, Thriving Through Community"
  • Description: Community for folks with chronic illness and those touched by cancer
  • CTA: "Join Our Community"

2. Happy Mail Section

  • Description: Free joy-filled snail mail program
  • By: Nerisa (sends to folks navigating cancer/chronic illness)
  • Who Can Receive:
    • Cancer diagnosis or treatment
    • Chronic illness or rare disease
    • Medical limbo or recovery
  • CTA: "Order Happy Mail"

3. Support & Community

  • Connect Section with quote: "We're here to create a safe, supportive space..."
  • Support Circle Login
  • Featured Stories: Jes & Den's journey with cancer/FAP

4. Podcast

  • "More Than A Diagnosis" podcast
  • Hosts: Jes and Den
  • Content: Real stories about life beyond medical diagnosis

5. Resources

  • Curated list of helpful resources for diagnosis navigation
  • Financial guidance
  • Support materials
  • Regularly updated

6. Wings of Remembrance

  • Memorial tribute section
  • Honor those who shaped the journey

7. Shop Products

Products with purpose-driven stories:

  • "Worst Club Best Members" Shirt - Inspired by Nerisa's Happy Mail program

    • Features duck with tattooed cancer ribbon
    • Celebrates community resilience
  • "More Than A Diagnosis" Shirt - Reminder of strength beyond diagnosis

    • For cancer/chronic illness advocates
    • Proceeds support advocacy work
  • "I Don't Want To / I Get To" Shirt - Jes's motto during cancer treatment

    • Perspective shift: "I get to because some folks don't get to"
    • Personal empowerment message
  • Ribbon Collection - Multi-cancer awareness

    • Represents all cancer types equally
    • Community-focused design

Extracted Content File

Location: /srv/containers/mtad-api/website_content.json

Structure:

{
  "title": "Page title",
  "url": "Page URL",
  "headings": [
    { "level": "H1", "text": "..." }
  ],
  "paragraphs": ["..."],
  "buttons": [
    { "text": "...", "href": "...", "class": "..." }
  ],
  "links": [
    { "text": "...", "href": "..." }
  ],
  "images": [
    { "src": "...", "alt": "...", "title": "..." }
  ],
  "sections": [
    { "heading": "...", "content": "..." }
  ],
  "fullText": "Complete rendered page text"
}

How to Use the Scraper

Run the Scraper

cd /srv/containers/mtad-api
node scraper.js

Output

  • Displays content summary to console
  • Saves full JSON to website_content.json
  • Shows preview of extracted content
  • Lists navigation links found

Modify the Scraper

Edit /srv/containers/mtad-api/scraper.js to:

  • Change target URL
  • Modify extraction selectors
  • Add/remove data fields
  • Adjust wait times for slower pages

Key Implementation Notes for Frontend

Design Approach

The website uses:

  • Clean, minimalist design
  • Community-focused messaging
  • Story-driven content
  • Purpose-driven product section
  • Warm, accessible tone

Content Themes

  1. Connection & Community - Central to brand identity
  2. Stories & Authenticity - Real people, real journeys
  3. Support & Resources - Practical help alongside emotional support
  4. Resilience & Empowerment - Beyond medical labels
  5. Inclusivity - All diagnoses, all experiences matter
  • Hero section with tagline
  • Happy Mail card/section
  • Story/testimonial cards (Jes & Den)
  • Podcast section
  • Resources directory
  • Product showcase (with story narratives)
  • Memorial/tribute section
  • CTA buttons throughout

Copy Guidelines

  • Use warm, empathetic language
  • Focus on community and connection
  • Highlight real stories and experiences
  • Emphasize accessibility and support
  • Avoid medical jargon where possible

Dependencies

Already Installed

  • Puppeteer (v23.0.0+)
  • Node.js (v18+)

To Scrape Other Sites

npm install puppeteer
node scraper.js

Next Steps for Claude Web

  1. Review extracted content at /srv/containers/mtad-api/website_content.json
  2. Create page components based on extracted structure:
    • Home page with hero, Happy Mail, Connect sections
    • Podcast page
    • Resources page
    • Shop page with product cards
    • Support pages (group, circle, journal, in-loving-memory)
  3. Implement navigation based on extracted links
  4. Style with Tailwind - Use existing design system
  5. Add API integration - Connect to backend for dynamic content
  6. Test all pages - Verify navigation, CTAs, responsiveness

Files & Resources

File Purpose
/srv/containers/mtad-api/scraper.js Puppeteer scraper script
/srv/containers/mtad-api/website_content.json Extracted content (JSON)
/srv/containers/mtad-api/web/ Frontend code (Next.js)
/srv/containers/mtad-api/HANDOFF_CLAUDE_WEB.md Frontend deployment handoff

Quick Reference: Extracted Data

Total Content:

  • 13 heading levels
  • 24 substantive paragraphs
  • 22 CTA buttons
  • 34 navigation/internal links
  • 15 images (with alt text)
  • 10 major page sections

Navigation Links Extracted:

  1. Join Our Community → /supportgroup
  2. Order Happy Mail → /happymail
  3. Podcast → /podcast
  4. Resources → /resources
  5. Support Group → /supportgroup
  6. Support Circle → /groups
  7. The Journal → /thejournal
  8. In Loving Memory → /inlovingmemory
  9. Connect With Us → /meetus
  10. Shop → /category/all-products

Troubleshooting

Scraper Not Finding Content

  • Check internet connection
  • Verify morethanadiagnosis.org is accessible
  • Increase timeout in scraper.js (line 32)
  • Check browser console for JavaScript errors

Missing Images

  • Image URLs are stored in website_content.json
  • Download and host locally in /public/images/
  • Update image src paths in components

Layout Questions

  • Wix uses grid-based mesh system
  • Responsive design adapts to mobile/desktop
  • See original site for exact spacing/sizing

Contact/Questions

If you need to:

  • Re-run the scraper: node scraper.js
  • Modify extraction logic: Edit /srv/containers/mtad-api/scraper.js
  • Add new pages: Use extracted content as template
  • Integrate with API: Check /srv/containers/mtad-api/backend/

All code is in the GitHub repo under the main branch. The scraper is production-ready and can be re-run at any time to update content.


Status: CONTENT EXTRACTION COMPLETE Ready for: Frontend implementation in Next.js Files Generated: website_content.json with full page structure and content

Good luck with the frontend implementation! The extracted content is comprehensive and ready to use. 🚀