Web Scrape Improvements #33

Open
opened 2020-03-20 04:34:05 -07:00 by Max · 1 comment
Owner
  • Better Error catching on scrape
  • Default to favicon if image can't be found
  • Scrape large sections of text
    ** Try to isolate article bodies
    ** Try to isolate recipe, targeting keywords like tablespoon, tbsp, etc
* Better Error catching on scrape * Default to favicon if image can't be found * Scrape large sections of text ** Try to isolate article bodies ** Try to isolate recipe, targeting keywords like tablespoon, tbsp, etc
Max added the
Enhancement
label 2020-03-20 04:34:05 -07:00
Max changed title from Web Scrap Improvements to Web Scrape Improvements 2020-03-20 04:42:21 -07:00
Author
Owner

Scrape Improvement Notes

  • Seperate out bodies of logic that scrape certain sections

    • Keyword scrape logic
    • Image scrape logic
    • Text content article/recipe content
  • Only prepend URL to images that don't have a full URL

  • Don't try to process ico files

  • If a small image is scraped, don't try to resize it

  • Don't pass URL params to scrape like ?v=htu

Scrape Improvement Notes * Seperate out bodies of logic that scrape certain sections * * Keyword scrape logic * * Image scrape logic * * Text content article/recipe content * Only prepend URL to images that don't have a full URL * Don't try to process ico files * If a small image is scraped, don't try to resize it * Don't pass URL params to scrape like ?v=htu
Max added the
In Progress
label 2020-04-09 15:45:23 -07:00
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Max/SolidScribe#33
No description provided.