Tiny image scraper for xkcd.comAmazon web scraperHTML Scraper for Plex downloads pageImage downloader for a websiteA tiny recursive crawlerWeb scraper for YellWeb scraper for harvesting a full-siteWeb scraper for a webpage articleComic Image Web ScraperPython yelp scraperthe fox is black - a simple image scraper
When do we use "no women" instead of "no woman"?
Turn off Google Chrome's Notification for "Flash Player will no longer be supported after December 2020."
Was there an original and definitive use of alternate dimensions/realities in fiction?
What is the motivation behind designing a control stick that does not move?
German equivalent to "going down the rabbit hole"
Is mathematics truth?
Why don't "echo -e" commands seem to produce the right output?
Datasets of Large Molecules
How to say "too quickly", "too recklessly" etc
What happens if you just start drawing from the Deck of Many Things without declaring any number of cards?
An alternative to "two column" geometry proofs
Can authors email you PDFs of their textbook for free?
Is it rude to ask my opponent to resign an online game when they have a lost endgame?
Replace a motion-sensor/timer with simple single pole switch
Inserting command output into multiline string
Heuristic argument for the Riemann Hypothesis
Is Borg adaptation only temporary?
Why are CEOs generally fired rather being demoted?
'spazieren' - walking in a silly and affected manner?
What are the electrical characteristics of a PC gameport?
Why do fuses burn at a specific current?
How to solve this inequality , when there is a irrational power?
Cheap oscilloscope showing 16 MHz square wave
What is the definition of Product
Tiny image scraper for xkcd.com
Amazon web scraperHTML Scraper for Plex downloads pageImage downloader for a websiteA tiny recursive crawlerWeb scraper for YellWeb scraper for harvesting a full-siteWeb scraper for a webpage articleComic Image Web ScraperPython yelp scraperthe fox is black - a simple image scraper
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
This came out on the spur of the moment, as a quick and dirty tool to have the job done.
This is a simple image scraper for immensely popular and legendary comic website inspired by a Python Easter Egg.
For those who don't know it, run your Python interpreter and type import antigravity
and hit Enter. :)
As for the code below, I'd appreciate any feedback, particularly in regards to threading, as I'm new to this.
#!/usr/bin/python3
import os
import sys
import time
import threading
from pathlib import Path
from shutil import copyfileobj
import requests
from lxml import html
BASE_URL = "https://www.xkcd.com/"
ARCHIVE = "https://www.xkcd.com/archive"
SAVE_DIRECTORY = Path('xkcd_comics')
LOGO = """
_ _
tiny | | image | | downloader for
__ _| | _____ __| | ___ ___ _ __ ___
/ / |/ / __/ _` | / __/ _ | '_ ` _
> <| < (_| (_| || (_| (_) | | | | | |
/_/__|______,_(_)______/|_| |_| |_|
version 0.1
"""
def show_logo():
print(LOGO)
def fetch_url(url: str) -> requests.Response:
return requests.get(url)
def head_option(values: list) -> str:
return next(iter(values), None)
def get_penultimate(url: str) -> int:
page = fetch_url(url)
tree = html.fromstring(page.content)
newest_comic = head_option(
tree.xpath('//*[@id="middleContainer"]/a[1]/@href'))
return int(newest_comic.replace("/", ""))
def get_images_from_page(url: str) -> str:
page = fetch_url(url)
tree = html.fromstring(page.content)
return head_option(tree.xpath('//*[@id="comic"]//img/@src'))
def get_number_of_pages(latest_comic: int) -> int:
print(f"There are latest_comic comics.")
print(f"How many do you want to download? Type 0 to exit.")
while True:
try:
number_of_comics = int(input(">> "))
except ValueError:
print("Error: Expected a number. Try again.")
continue
if number_of_comics > latest_comic or number_of_comics < 0:
print("Error: Incorrect number of comics. Try again.")
continue
elif number_of_comics == 0:
sys.exit()
return number_of_comics
def clip_url(img: str) -> str:
return img.rpartition("/")[-1]
def make_dir():
return os.makedirs(SAVE_DIRECTORY, exist_ok=True)
def save_image(img: str):
comic_name = clip_url(img)
print(f"Downloading: comic_name")
f_name = SAVE_DIRECTORY / comic_name
with requests.get("https:" + img, stream=True) as img, open(f_name, "wb")
as output:
copyfileobj(img.raw, output)
def show_time(seconds: int) -> int:
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
time_elapsed = f"hours:02d:minutes:02d:seconds:02d"
return time_elapsed
def get_xkcd():
show_logo()
make_dir()
collect_garbage = []
latest_comic = get_penultimate(ARCHIVE)
pages = get_number_of_pages(latest_comic)
start = time.time()
for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
print(f"Fetching page page out of latest_comic")
try:
url = get_images_from_page(f"BASE_URLpage/")
thread = threading.Thread(target=save_image, args=(url, ))
thread.start()
except (ValueError, AttributeError, requests.exceptions.MissingSchema):
print(f"WARNING: Invalid comic image source url.")
collect_garbage.append(f"BASE_URLpage")
continue
thread.join()
end = time.time()
print(f"Downloaded pages comic(s) in show_time(int(end - start)).")
if len(collect_garbage) > 0:
print("However, was unable to download images for these pages:")
print("n".join(page for page in collect_garbage))
def main():
get_xkcd()
if __name__ == '__main__':
main()
python python-3.x web-scraping
$endgroup$
add a comment |
$begingroup$
This came out on the spur of the moment, as a quick and dirty tool to have the job done.
This is a simple image scraper for immensely popular and legendary comic website inspired by a Python Easter Egg.
For those who don't know it, run your Python interpreter and type import antigravity
and hit Enter. :)
As for the code below, I'd appreciate any feedback, particularly in regards to threading, as I'm new to this.
#!/usr/bin/python3
import os
import sys
import time
import threading
from pathlib import Path
from shutil import copyfileobj
import requests
from lxml import html
BASE_URL = "https://www.xkcd.com/"
ARCHIVE = "https://www.xkcd.com/archive"
SAVE_DIRECTORY = Path('xkcd_comics')
LOGO = """
_ _
tiny | | image | | downloader for
__ _| | _____ __| | ___ ___ _ __ ___
/ / |/ / __/ _` | / __/ _ | '_ ` _
> <| < (_| (_| || (_| (_) | | | | | |
/_/__|______,_(_)______/|_| |_| |_|
version 0.1
"""
def show_logo():
print(LOGO)
def fetch_url(url: str) -> requests.Response:
return requests.get(url)
def head_option(values: list) -> str:
return next(iter(values), None)
def get_penultimate(url: str) -> int:
page = fetch_url(url)
tree = html.fromstring(page.content)
newest_comic = head_option(
tree.xpath('//*[@id="middleContainer"]/a[1]/@href'))
return int(newest_comic.replace("/", ""))
def get_images_from_page(url: str) -> str:
page = fetch_url(url)
tree = html.fromstring(page.content)
return head_option(tree.xpath('//*[@id="comic"]//img/@src'))
def get_number_of_pages(latest_comic: int) -> int:
print(f"There are latest_comic comics.")
print(f"How many do you want to download? Type 0 to exit.")
while True:
try:
number_of_comics = int(input(">> "))
except ValueError:
print("Error: Expected a number. Try again.")
continue
if number_of_comics > latest_comic or number_of_comics < 0:
print("Error: Incorrect number of comics. Try again.")
continue
elif number_of_comics == 0:
sys.exit()
return number_of_comics
def clip_url(img: str) -> str:
return img.rpartition("/")[-1]
def make_dir():
return os.makedirs(SAVE_DIRECTORY, exist_ok=True)
def save_image(img: str):
comic_name = clip_url(img)
print(f"Downloading: comic_name")
f_name = SAVE_DIRECTORY / comic_name
with requests.get("https:" + img, stream=True) as img, open(f_name, "wb")
as output:
copyfileobj(img.raw, output)
def show_time(seconds: int) -> int:
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
time_elapsed = f"hours:02d:minutes:02d:seconds:02d"
return time_elapsed
def get_xkcd():
show_logo()
make_dir()
collect_garbage = []
latest_comic = get_penultimate(ARCHIVE)
pages = get_number_of_pages(latest_comic)
start = time.time()
for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
print(f"Fetching page page out of latest_comic")
try:
url = get_images_from_page(f"BASE_URLpage/")
thread = threading.Thread(target=save_image, args=(url, ))
thread.start()
except (ValueError, AttributeError, requests.exceptions.MissingSchema):
print(f"WARNING: Invalid comic image source url.")
collect_garbage.append(f"BASE_URLpage")
continue
thread.join()
end = time.time()
print(f"Downloaded pages comic(s) in show_time(int(end - start)).")
if len(collect_garbage) > 0:
print("However, was unable to download images for these pages:")
print("n".join(page for page in collect_garbage))
def main():
get_xkcd()
if __name__ == '__main__':
main()
python python-3.x web-scraping
$endgroup$
add a comment |
$begingroup$
This came out on the spur of the moment, as a quick and dirty tool to have the job done.
This is a simple image scraper for immensely popular and legendary comic website inspired by a Python Easter Egg.
For those who don't know it, run your Python interpreter and type import antigravity
and hit Enter. :)
As for the code below, I'd appreciate any feedback, particularly in regards to threading, as I'm new to this.
#!/usr/bin/python3
import os
import sys
import time
import threading
from pathlib import Path
from shutil import copyfileobj
import requests
from lxml import html
BASE_URL = "https://www.xkcd.com/"
ARCHIVE = "https://www.xkcd.com/archive"
SAVE_DIRECTORY = Path('xkcd_comics')
LOGO = """
_ _
tiny | | image | | downloader for
__ _| | _____ __| | ___ ___ _ __ ___
/ / |/ / __/ _` | / __/ _ | '_ ` _
> <| < (_| (_| || (_| (_) | | | | | |
/_/__|______,_(_)______/|_| |_| |_|
version 0.1
"""
def show_logo():
print(LOGO)
def fetch_url(url: str) -> requests.Response:
return requests.get(url)
def head_option(values: list) -> str:
return next(iter(values), None)
def get_penultimate(url: str) -> int:
page = fetch_url(url)
tree = html.fromstring(page.content)
newest_comic = head_option(
tree.xpath('//*[@id="middleContainer"]/a[1]/@href'))
return int(newest_comic.replace("/", ""))
def get_images_from_page(url: str) -> str:
page = fetch_url(url)
tree = html.fromstring(page.content)
return head_option(tree.xpath('//*[@id="comic"]//img/@src'))
def get_number_of_pages(latest_comic: int) -> int:
print(f"There are latest_comic comics.")
print(f"How many do you want to download? Type 0 to exit.")
while True:
try:
number_of_comics = int(input(">> "))
except ValueError:
print("Error: Expected a number. Try again.")
continue
if number_of_comics > latest_comic or number_of_comics < 0:
print("Error: Incorrect number of comics. Try again.")
continue
elif number_of_comics == 0:
sys.exit()
return number_of_comics
def clip_url(img: str) -> str:
return img.rpartition("/")[-1]
def make_dir():
return os.makedirs(SAVE_DIRECTORY, exist_ok=True)
def save_image(img: str):
comic_name = clip_url(img)
print(f"Downloading: comic_name")
f_name = SAVE_DIRECTORY / comic_name
with requests.get("https:" + img, stream=True) as img, open(f_name, "wb")
as output:
copyfileobj(img.raw, output)
def show_time(seconds: int) -> int:
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
time_elapsed = f"hours:02d:minutes:02d:seconds:02d"
return time_elapsed
def get_xkcd():
show_logo()
make_dir()
collect_garbage = []
latest_comic = get_penultimate(ARCHIVE)
pages = get_number_of_pages(latest_comic)
start = time.time()
for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
print(f"Fetching page page out of latest_comic")
try:
url = get_images_from_page(f"BASE_URLpage/")
thread = threading.Thread(target=save_image, args=(url, ))
thread.start()
except (ValueError, AttributeError, requests.exceptions.MissingSchema):
print(f"WARNING: Invalid comic image source url.")
collect_garbage.append(f"BASE_URLpage")
continue
thread.join()
end = time.time()
print(f"Downloaded pages comic(s) in show_time(int(end - start)).")
if len(collect_garbage) > 0:
print("However, was unable to download images for these pages:")
print("n".join(page for page in collect_garbage))
def main():
get_xkcd()
if __name__ == '__main__':
main()
python python-3.x web-scraping
$endgroup$
This came out on the spur of the moment, as a quick and dirty tool to have the job done.
This is a simple image scraper for immensely popular and legendary comic website inspired by a Python Easter Egg.
For those who don't know it, run your Python interpreter and type import antigravity
and hit Enter. :)
As for the code below, I'd appreciate any feedback, particularly in regards to threading, as I'm new to this.
#!/usr/bin/python3
import os
import sys
import time
import threading
from pathlib import Path
from shutil import copyfileobj
import requests
from lxml import html
BASE_URL = "https://www.xkcd.com/"
ARCHIVE = "https://www.xkcd.com/archive"
SAVE_DIRECTORY = Path('xkcd_comics')
LOGO = """
_ _
tiny | | image | | downloader for
__ _| | _____ __| | ___ ___ _ __ ___
/ / |/ / __/ _` | / __/ _ | '_ ` _
> <| < (_| (_| || (_| (_) | | | | | |
/_/__|______,_(_)______/|_| |_| |_|
version 0.1
"""
def show_logo():
print(LOGO)
def fetch_url(url: str) -> requests.Response:
return requests.get(url)
def head_option(values: list) -> str:
return next(iter(values), None)
def get_penultimate(url: str) -> int:
page = fetch_url(url)
tree = html.fromstring(page.content)
newest_comic = head_option(
tree.xpath('//*[@id="middleContainer"]/a[1]/@href'))
return int(newest_comic.replace("/", ""))
def get_images_from_page(url: str) -> str:
page = fetch_url(url)
tree = html.fromstring(page.content)
return head_option(tree.xpath('//*[@id="comic"]//img/@src'))
def get_number_of_pages(latest_comic: int) -> int:
print(f"There are latest_comic comics.")
print(f"How many do you want to download? Type 0 to exit.")
while True:
try:
number_of_comics = int(input(">> "))
except ValueError:
print("Error: Expected a number. Try again.")
continue
if number_of_comics > latest_comic or number_of_comics < 0:
print("Error: Incorrect number of comics. Try again.")
continue
elif number_of_comics == 0:
sys.exit()
return number_of_comics
def clip_url(img: str) -> str:
return img.rpartition("/")[-1]
def make_dir():
return os.makedirs(SAVE_DIRECTORY, exist_ok=True)
def save_image(img: str):
comic_name = clip_url(img)
print(f"Downloading: comic_name")
f_name = SAVE_DIRECTORY / comic_name
with requests.get("https:" + img, stream=True) as img, open(f_name, "wb")
as output:
copyfileobj(img.raw, output)
def show_time(seconds: int) -> int:
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
time_elapsed = f"hours:02d:minutes:02d:seconds:02d"
return time_elapsed
def get_xkcd():
show_logo()
make_dir()
collect_garbage = []
latest_comic = get_penultimate(ARCHIVE)
pages = get_number_of_pages(latest_comic)
start = time.time()
for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
print(f"Fetching page page out of latest_comic")
try:
url = get_images_from_page(f"BASE_URLpage/")
thread = threading.Thread(target=save_image, args=(url, ))
thread.start()
except (ValueError, AttributeError, requests.exceptions.MissingSchema):
print(f"WARNING: Invalid comic image source url.")
collect_garbage.append(f"BASE_URLpage")
continue
thread.join()
end = time.time()
print(f"Downloaded pages comic(s) in show_time(int(end - start)).")
if len(collect_garbage) > 0:
print("However, was unable to download images for these pages:")
print("n".join(page for page in collect_garbage))
def main():
get_xkcd()
if __name__ == '__main__':
main()
python python-3.x web-scraping
python python-3.x web-scraping
edited 10 hours ago
baduker
asked 11 hours ago
badukerbaduker
3801 silver badge13 bronze badges
3801 silver badge13 bronze badges
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Globals
As it is, there's no point to LOGO
being a global. It's only used by show_logo
, so move it there.
Base URLs
You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:
ARCHIVE = "https://www.xkcd.com/archive"
This ignores the BASE_URL
entirely, when it shouldn't.
fetch_url
is currently useless - it doesn't add anything to requests.get
. You could make it useful by making the argument a path relative to the base path.
with requests.get("https:" + img
# ...
url = get_images_from_page(f"BASE_URLpage/")
Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib
to deal with URL parsing and construction.
show_time
divmod
on a numeric time interval is not the right thing to do. Use datetime.timedelta
.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f227280%2ftiny-image-scraper-for-xkcd-com%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Globals
As it is, there's no point to LOGO
being a global. It's only used by show_logo
, so move it there.
Base URLs
You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:
ARCHIVE = "https://www.xkcd.com/archive"
This ignores the BASE_URL
entirely, when it shouldn't.
fetch_url
is currently useless - it doesn't add anything to requests.get
. You could make it useful by making the argument a path relative to the base path.
with requests.get("https:" + img
# ...
url = get_images_from_page(f"BASE_URLpage/")
Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib
to deal with URL parsing and construction.
show_time
divmod
on a numeric time interval is not the right thing to do. Use datetime.timedelta
.
$endgroup$
add a comment |
$begingroup$
Globals
As it is, there's no point to LOGO
being a global. It's only used by show_logo
, so move it there.
Base URLs
You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:
ARCHIVE = "https://www.xkcd.com/archive"
This ignores the BASE_URL
entirely, when it shouldn't.
fetch_url
is currently useless - it doesn't add anything to requests.get
. You could make it useful by making the argument a path relative to the base path.
with requests.get("https:" + img
# ...
url = get_images_from_page(f"BASE_URLpage/")
Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib
to deal with URL parsing and construction.
show_time
divmod
on a numeric time interval is not the right thing to do. Use datetime.timedelta
.
$endgroup$
add a comment |
$begingroup$
Globals
As it is, there's no point to LOGO
being a global. It's only used by show_logo
, so move it there.
Base URLs
You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:
ARCHIVE = "https://www.xkcd.com/archive"
This ignores the BASE_URL
entirely, when it shouldn't.
fetch_url
is currently useless - it doesn't add anything to requests.get
. You could make it useful by making the argument a path relative to the base path.
with requests.get("https:" + img
# ...
url = get_images_from_page(f"BASE_URLpage/")
Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib
to deal with URL parsing and construction.
show_time
divmod
on a numeric time interval is not the right thing to do. Use datetime.timedelta
.
$endgroup$
Globals
As it is, there's no point to LOGO
being a global. It's only used by show_logo
, so move it there.
Base URLs
You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:
ARCHIVE = "https://www.xkcd.com/archive"
This ignores the BASE_URL
entirely, when it shouldn't.
fetch_url
is currently useless - it doesn't add anything to requests.get
. You could make it useful by making the argument a path relative to the base path.
with requests.get("https:" + img
# ...
url = get_images_from_page(f"BASE_URLpage/")
Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib
to deal with URL parsing and construction.
show_time
divmod
on a numeric time interval is not the right thing to do. Use datetime.timedelta
.
answered 3 hours ago
ReinderienReinderien
7,08711 silver badges32 bronze badges
7,08711 silver badges32 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f227280%2ftiny-image-scraper-for-xkcd-com%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown