If you’ve tried scraping data from messy web pages, you know the pain: inconsistent layouts, random HTML, and elements that seem to shift every time you reload. Out-of-the-box tools usually choke on this stuff. If you’re looking to bend a tool like Scrapestorm to your will—extracting exactly the data you want, even from the ugliest sites—this guide is for you.
Below, I’ll walk you through setting up custom extraction rules in Scrapestorm for those chaotic, “unstructured” pages. We’ll skip the hype and stick to what actually works (and what doesn’t), so you can spend less time fighting with your scrapers and more time getting useful data.
1. Understand What “Unstructured” Really Means
Before you fire up Scrapestorm, get clear on what you’re dealing with. “Unstructured” web pages usually mean:
- There’s no tidy table or list.
- Item layouts are inconsistent—sometimes missing fields, sometimes reordered.
- The HTML might be full of inline styles, nested divs, or random junk.
- Elements might not have reliable IDs, classes, or data attributes.
Pro tip: Don’t assume AI or “auto-detect” features will magically solve this. They’re good for simple cases, but unstructured pages almost always need hand-crafted rules.
2. Prep: Set Up Your Environment
Let’s keep this simple:
- Make sure you’ve got the latest version of Scrapestorm installed (desktop is easier for custom rules than cloud).
- Open the site you want to scrape in your browser. Right-click and “Inspect” the elements you care about. Get cozy with the HTML—it’s your map.
- Gather a list of URLs you want to scrape. Scrapestorm can crawl, but for testing, start with just one or two.
3. Start a New Task and Enter the Target URL
- Open Scrapestorm and hit “New Task.”
- Paste your target URL. Let Scrapestorm load the page preview.
At this point, Scrapestorm will try to auto-extract “obvious” data. If your page is a mess, it’ll probably miss the mark. That’s normal.
4. Switch to Manual (Advanced) Extraction
- Look for the “Manual Mode” or “Custom Extraction” option (names change slightly between versions, but it’s there).
- Ignore the “AI Extraction” or “One-Click Extraction” for now—they’re decent for clean sites, but will almost always fail on unstructured layouts.
Now you’re in the driver’s seat. This is where you build extraction rules by hand.
5. Identify Patterns and Anchor Points
The biggest mistake? Trying to write one rule for everything. Instead:
- Scan the HTML. Look for anything consistent: maybe every item starts with an
<h2>
, or a certain phrase always appears before the price. - Find anchor points. If there are fields you always want (like “Title”), see if they always have the same type of container or nearby sibling.
- Accept imperfection. Sometimes, a rule will grab a bit of extra junk. That’s okay—cleaning up later is easier than missing data entirely.
What NOT to do: Don’t waste time trying to build a perfect, all-in-one selector. It’s better to get “good enough” and iterate.
6. Build Your Extraction Rules
This is the “meat” of the process:
a. Use XPath or CSS Selectors
Scrapestorm lets you specify extraction rules using XPath or CSS selectors. Here’s how to approach it:
- Right-click the element you want in the preview, and pick “Extract this element.” Scrapestorm will auto-fill a selector.
- Check the selector: Is it too specific (relies on dynamic IDs)? Too broad (grabs the whole page)? Edit as needed.
- Test the rule on several examples. Use the built-in preview to see what it grabs.
Quick guide:
- XPath is more powerful for irregular layouts but has a bit of a learning curve.
- CSS selectors are simpler but can be brittle if the page structure changes.
Example:
If you want to grab product titles, and they’re inside <div>
s with class product-title
, try:
//div[contains(@class, 'product-title')]
(XPath)
or
div.product-title
(CSS)
b. Handle Optional/Missing Fields
Unstructured pages often skip fields or change their order. Here’s how to deal:
- Write fallback selectors (Scrapestorm lets you chain rules or use “if exists” logic).
- Accept that some rows will be blank, and plan to clean up missing data later.
c. Extract Text, Attributes, or HTML
Specify exactly what you want:
- Text: Usually default—gets the visible text.
- Attribute: For links or images, grab the
href
orsrc
. - HTML: Sometimes you want the full block, not just text.
Be precise. If you only want the URL, don’t grab the full <a>
tag—just the href
.
7. Set Up Pagination (If Needed)
If your data spans multiple pages:
- Use Scrapestorm’s pagination tool. You’ll often have to point it to the “Next” button or link.
- For unstructured pages, avoid “infinite scroll” if you can—Scrapestorm’s handling is hit-or-miss.
- Test your rule on the second and third pages. Pagination links sometimes change or break.
8. Clean and Transform Your Data On the Fly
Scrapestorm has basic data cleanup tools:
- Trim whitespace—checkbox option.
- Find and replace (e.g., remove “USD” from prices).
- Regex extraction for pulling out patterns within text.
Don’t try to do all your data cleaning here. Just get the basics—major fixes are easier in Excel, Python, or wherever you process your data later.
9. Test, Preview, and Fix Edge Cases
- Preview your extraction. Scrapestorm shows you what your rules pull out. Don’t skip this step.
- Test on a handful of pages—not just one. Unstructured sites love to throw curveballs.
- Adjust your selectors as needed when you spot weird or missing data.
Pro tip: Save and label versions of your rules. When (not if) the site changes, you’ll be glad you did.
10. Export and Review Your Data
- Export as CSV, Excel, or JSON.
- Open your export right away—don’t assume it’s all perfect.
- Look for:
- Missing or misaligned fields
- Junk data (e.g., HTML tags where you wanted text)
- Duplicates
Scrapestorm is fast at exporting, but don’t let that lull you into skipping the quality check.
What Works, and What Doesn’t
- Works: Manual extraction with XPath/CSS, batch testing, iterative tweaking.
- Doesn’t work: Auto-extract for messy pages, assuming every item is structured the same, building one huge selector for everything.
- Ignore: Fancy “AI extraction” and “template” features until you’ve nailed down your custom rules. They’re fine for simple sites but will waste your time on anything unstructured.
When to Stop “Perfecting” and Just Ship It
Don’t get sucked into endless tweaking. If you’re getting 90% of the data cleanly and consistently, pull the trigger and export. Plan to revisit if the site changes or you spot big gaps.
Remember: Web pages change. Your rules will break eventually. The fastest way to stay productive is to keep your process simple, start small, and get comfortable with regular updates.
Bottom line: Setting up custom extraction rules in Scrapestorm for unstructured pages isn’t magic—it’s a bit of detective work, a bit of trial and error, and lots of patience. Start with small, testable rules, don’t chase “perfect,” and you’ll get the data you need without losing your mind.