morethanadiagnosis-hub/HANDOFF_WEBSITE_SCRAPER.md
admin da63a31c95 feat: add website scraper and handoff documentation for claude-web
- Create Puppeteer-based scraper for morethanadiagnosis.org
- Extract full page structure, content, navigation, and images
- Generate JSON output with 13 headings, 24 paragraphs, 22 CTAs, 34 links, 15 images
- Add comprehensive handoff doc with implementation guide for frontend
- Document all website sections: Happy Mail, Support, Podcast, Resources, Shop
- Include content themes and recommendations for Next.js components
2025-11-18 17:17:43 +00:00

302 lines
8.1 KiB
Markdown

# Handoff: Website Content Scraper & Frontend Implementation
**Date**: November 18, 2025
**From**: Claude (CL)
**To**: Claude Web
**Status**: Website content extracted and ready for frontend implementation
---
## Overview
A Puppeteer-based web scraper has been successfully created to extract and analyze the morethanadiagnosis.org website. All content, structure, navigation, and assets have been captured and saved for frontend replication.
---
## What's Been Completed
### ✅ Web Scraper Created
- **Location**: `/srv/containers/mtad-api/scraper.js`
- **Technology**: Puppeteer (headless browser automation)
- **Purpose**: Dynamically render and extract JavaScript-heavy Wix website content
### ✅ Content Successfully Extracted
- **Output**: `/srv/containers/mtad-api/website_content.json`
- **Format**: Structured JSON with all page elements
### ✅ Data Captured
- **13 Headings** - All H1-H6 elements across the page
- **24 Paragraphs** - Body text and descriptions
- **22 Buttons/CTAs** - Call-to-action elements
- **34 Links** - Navigation and external links
- **15 Images** - Images with alt text and URLs
- **10 Sections** - Major content sections
- **Full text** - Complete rendered page content
---
## Extracted Website Structure
### Navigation Menu
```
Home
├── Podcast
├── Resources
├── Happy Mail
├── Support Group
├── Support Circle
├── The Journal
├── In Loving Memory
├── Connect With Us
└── Shop
```
### Key Pages & Content Areas
#### 1. **Homepage Hero**
- Title: "You are more than a diagnosis"
- Tagline: "Connecting Through Stories, Thriving Through Community"
- Description: Community for folks with chronic illness and those touched by cancer
- CTA: "Join Our Community"
#### 2. **Happy Mail Section**
- Description: Free joy-filled snail mail program
- By: Nerisa (sends to folks navigating cancer/chronic illness)
- Who Can Receive:
- Cancer diagnosis or treatment
- Chronic illness or rare disease
- Medical limbo or recovery
- CTA: "Order Happy Mail"
#### 3. **Support & Community**
- Connect Section with quote: "We're here to create a safe, supportive space..."
- Support Circle Login
- Featured Stories: Jes & Den's journey with cancer/FAP
#### 4. **Podcast**
- "More Than A Diagnosis" podcast
- Hosts: Jes and Den
- Content: Real stories about life beyond medical diagnosis
#### 5. **Resources**
- Curated list of helpful resources for diagnosis navigation
- Financial guidance
- Support materials
- Regularly updated
#### 6. **Wings of Remembrance**
- Memorial tribute section
- Honor those who shaped the journey
#### 7. **Shop Products**
Products with purpose-driven stories:
- **"Worst Club Best Members" Shirt** - Inspired by Nerisa's Happy Mail program
- Features duck with tattooed cancer ribbon
- Celebrates community resilience
- **"More Than A Diagnosis" Shirt** - Reminder of strength beyond diagnosis
- For cancer/chronic illness advocates
- Proceeds support advocacy work
- **"I Don't Want To / I Get To" Shirt** - Jes's motto during cancer treatment
- Perspective shift: "I get to because some folks don't get to"
- Personal empowerment message
- **Ribbon Collection** - Multi-cancer awareness
- Represents all cancer types equally
- Community-focused design
---
## Extracted Content File
**Location**: `/srv/containers/mtad-api/website_content.json`
**Structure**:
```json
{
"title": "Page title",
"url": "Page URL",
"headings": [
{ "level": "H1", "text": "..." }
],
"paragraphs": ["..."],
"buttons": [
{ "text": "...", "href": "...", "class": "..." }
],
"links": [
{ "text": "...", "href": "..." }
],
"images": [
{ "src": "...", "alt": "...", "title": "..." }
],
"sections": [
{ "heading": "...", "content": "..." }
],
"fullText": "Complete rendered page text"
}
```
---
## How to Use the Scraper
### Run the Scraper
```bash
cd /srv/containers/mtad-api
node scraper.js
```
### Output
- Displays content summary to console
- Saves full JSON to `website_content.json`
- Shows preview of extracted content
- Lists navigation links found
### Modify the Scraper
Edit `/srv/containers/mtad-api/scraper.js` to:
- Change target URL
- Modify extraction selectors
- Add/remove data fields
- Adjust wait times for slower pages
---
## Key Implementation Notes for Frontend
### Design Approach
The website uses:
- Clean, minimalist design
- Community-focused messaging
- Story-driven content
- Purpose-driven product section
- Warm, accessible tone
### Content Themes
1. **Connection & Community** - Central to brand identity
2. **Stories & Authenticity** - Real people, real journeys
3. **Support & Resources** - Practical help alongside emotional support
4. **Resilience & Empowerment** - Beyond medical labels
5. **Inclusivity** - All diagnoses, all experiences matter
### Recommended Components to Build
- [ ] Hero section with tagline
- [ ] Happy Mail card/section
- [ ] Story/testimonial cards (Jes & Den)
- [ ] Podcast section
- [ ] Resources directory
- [ ] Product showcase (with story narratives)
- [ ] Memorial/tribute section
- [ ] CTA buttons throughout
### Copy Guidelines
- Use warm, empathetic language
- Focus on community and connection
- Highlight real stories and experiences
- Emphasize accessibility and support
- Avoid medical jargon where possible
---
## Dependencies
### Already Installed
- Puppeteer (v23.0.0+)
- Node.js (v18+)
### To Scrape Other Sites
```bash
npm install puppeteer
node scraper.js
```
---
## Next Steps for Claude Web
1. **Review extracted content** at `/srv/containers/mtad-api/website_content.json`
2. **Create page components** based on extracted structure:
- Home page with hero, Happy Mail, Connect sections
- Podcast page
- Resources page
- Shop page with product cards
- Support pages (group, circle, journal, in-loving-memory)
3. **Implement navigation** based on extracted links
4. **Style with Tailwind** - Use existing design system
5. **Add API integration** - Connect to backend for dynamic content
6. **Test all pages** - Verify navigation, CTAs, responsiveness
---
## Files & Resources
| File | Purpose |
|------|---------|
| `/srv/containers/mtad-api/scraper.js` | Puppeteer scraper script |
| `/srv/containers/mtad-api/website_content.json` | Extracted content (JSON) |
| `/srv/containers/mtad-api/web/` | Frontend code (Next.js) |
| `/srv/containers/mtad-api/HANDOFF_CLAUDE_WEB.md` | Frontend deployment handoff |
---
## Quick Reference: Extracted Data
**Total Content**:
- 13 heading levels
- 24 substantive paragraphs
- 22 CTA buttons
- 34 navigation/internal links
- 15 images (with alt text)
- 10 major page sections
**Navigation Links Extracted**:
1. Join Our Community → /supportgroup
2. Order Happy Mail → /happymail
3. Podcast → /podcast
4. Resources → /resources
5. Support Group → /supportgroup
6. Support Circle → /groups
7. The Journal → /thejournal
8. In Loving Memory → /inlovingmemory
9. Connect With Us → /meetus
10. Shop → /category/all-products
---
## Troubleshooting
### Scraper Not Finding Content
- Check internet connection
- Verify morethanadiagnosis.org is accessible
- Increase timeout in scraper.js (line 32)
- Check browser console for JavaScript errors
### Missing Images
- Image URLs are stored in website_content.json
- Download and host locally in `/public/images/`
- Update image src paths in components
### Layout Questions
- Wix uses grid-based mesh system
- Responsive design adapts to mobile/desktop
- See original site for exact spacing/sizing
---
## Contact/Questions
If you need to:
- Re-run the scraper: `node scraper.js`
- Modify extraction logic: Edit `/srv/containers/mtad-api/scraper.js`
- Add new pages: Use extracted content as template
- Integrate with API: Check `/srv/containers/mtad-api/backend/`
All code is in the GitHub repo under the main branch. The scraper is production-ready and can be re-run at any time to update content.
---
**Status**: CONTENT EXTRACTION COMPLETE ✅
**Ready for**: Frontend implementation in Next.js
**Files Generated**: website_content.json with full page structure and content
Good luck with the frontend implementation! The extracted content is comprehensive and ready to use. 🚀