this post was submitted on 18 Nov 2024

52 points (94.8% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54669 readers

545 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 1 year ago

MODERATORS

db0@lemmy.dbzer0.com

sunbrothersco@lemmy.dbzer0.com

dataprolet@lemmy.dbzer0.com

Flatworm7591@lemmy.dbzer0.com

RandomLegend@lemmy.dbzer0.com

How do I use Open Source scrapers? (Selenium, Scrapy, etc.) (lemmy.dbzer0.com)

submitted 2 days ago* (last edited 2 days ago) by MaggotInfested@lemmy.dbzer0.com to c/piracy@lemmy.dbzer0.com

39 comments fedilink hide all child comments

I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can't seem to make it click.

For context I am trying to scrape books myself that I can't seem to find elsewhere so I can use and post them for others.

The scraper tutorial

Hackernoon tutorial by Ethan Jarell

I initially tried to follow this but I kept having a "couldn't find module" error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there's someone who could guide me through this tutorial that would be great.

Selenium

Selenium Homepage

I don't really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn't seem to work (syntax error). I don't know how to manually add it in because, again, I have little idea of what I'm doing.

Scrapy

Scrapy Homepage

This one seemed like it'd be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.

I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!

Updates

Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner. py -m followed by the pip request

Got the Ethan Jarrell tutorial to work and managed to add in selenium, which made me realize that selenium isn't really helpful with the project. rip xP
Spent a bunch of time trying to workshop the basic scraper to work with dynamic sites, unsuccessful
Online self-help doesn't go in as much as I would like, probably due to the legal grey area

you are viewing a single comment's thread
view the rest of the comments

[–] aMockTie@beehaw.org 4 points 2 days ago (11 children)

Selenium is really more of a testing framework for frontend developers, and could theoretically be used for scraping, but that would be somewhat like buying a car based on the paint and not looking in detail under the hood.

I can't say I've ever worked with scrappy, but the tool I would use for web scraping with Python is BeautifulSoup. This tutorial seems decent enough, but you will need to understand basic web concepts like IDs, classes, tags, and tag attributes to get the most out of the tutorial: https://geekpython.medium.com/web-scraping-in-python-using-beautifulsoup-3207c038723b

W3Schools will also be your friend if you have questions about HTML/CSS selectors in general: https://www.w3schools.com/html/default.asp

Understanding regular expressions and/or xpath would also be very helpful, but are probably best considered to be extra credit in most cases.

I'll try to respond if you have any issues or questions, but hopefully that gives you enough to get started.

[–] chicken@lemmy.dbzer0.com 6 points 2 days ago (6 children)

The reason to use Selenium is if the website you want to scrape uses javascript in a way that inhibits getting content without a full browser environment. BeautifulSoup is just a parser, it can't solve that problem.

[–] aMockTie@beehaw.org 3 points 2 days ago (1 children)

In my experience, this scenario typically means that there is some sort of API (very likely undocumented) that is being used on the backend. That requires a bit more investigation and testing with browser developer tools, the JS Console, and often trial and error. But once you overcome that (admittedly very complex and technical) hurdle, you can almost always get away with just using the requests library at that point.

I've had to do that kind of thing more times than I'd like to admit, but the juice is almost always worth the squeeze.

[–] chicken@lemmy.dbzer0.com 1 points 2 days ago (1 children)

Well if I was doing it I probably would be trying to focus on browser emulation to avoid having to dig into those sorts of details. It sounds like OP is a beginner and needs a simple method.

[–] aMockTie@beehaw.org 2 points 2 days ago

I agree that OP sounds like a beginner, and what you've suggested is likely the best approach for someone who is familiar with frontend tools and frameworks. Selenium (and admittedly BeautifulSoup) is probably too low level for this particular user, but that doesn't mean they can't still learn some fundamentals while solving this problem without resorting to something as heavy and complicated as background browser emulation and rendering. I could be wrong though.

load more comments (4 replies)

load more comments (8 replies)