I initially tried to follow this but I kept having a "couldn't find module" error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there's someone who could guide me through this tutorial that would be great.

Selenium

Selenium Homepage

I don't really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn't seem to work (syntax error). I don't know how to manually add it in because, again, I have little idea of what I'm doing.

Scrapy

Scrapy Homepage

This one seemed like it'd be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.

I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!

Updates

Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner. py -m followed by the pip request

Got the Ethan Jarrell tutorial to work and managed to add in selenium, which made me realize that selenium isn't really helpful with the project. rip xP
Spent a bunch of time trying to workshop the basic scraper to work with dynamic sites, unsuccessful
Online self-help doesn't go in as much as I would like, probably due to the legal grey area

you are viewing a single comment's thread
view the rest of the comments

[–] Kissaki@lemmy.dbzer0.com 4 points 10 months ago* (last edited 10 months ago)

Depending on what you want to scape, that's a lot of overkill and overcomplication. Full website testing frameworks may not be necessary to scrape. Python with it's tooling and package management may not be necessary.

I've recently extracted and downloaded stuff via Nushell.

Requirement: Knowledge of CSS Selectors
Inspect Website DOM in Webbrowser web developer tools
1. Identify structure
2. Identify adequate selectors; testable via browser dev tools console document.querySelectorAll()
Get and query data

For me, my command line terminal and scripting language of choice is Nushell:

let $html = http get 'https://example.org/'
let $meta = $html | query web --query '#infobox .title, #infobox .tags' |  | { title: $in.0.0 tags: $in.1.0 }
let $content = $html | query web --query 'main img' --attribute data-src
$meta | save meta.json

1..30 | each {|x| http get $'https://example.org/img/($x).jpg' | save $'($x).jpg'; sleep 100ms }

Depending on the tools you use, it'll be quite similar or very different.

Selenium is an entire web-browser driver meaning it does a lot more and has a more extensive interface because of it; and you can talk to it through different interfaces and languages.