Scraping.Scraper

index
c:\users\adamj\onedrive\pulpit\olx-scrapper\src\scraping\scraper.py

Modules

aiohttp
asyncio
bs4
json
os
pandas
re

Classes

builtins.object

Scraper

class Scraper(builtins.object)

Scraper(url_strings: list[src.Scraping.URLBuilder.URLBuilder], page_limit: int) -> None

Methods defined here:

__init__(self, url_strings: list[src.Scraping.URLBuilder.URLBuilder], page_limit: int) -> None
Scraper class for scraping data from OLX. :param url_strings: List of URLBuilder objects for scraping data. :param page_limit: Limit of pages to scrape for each URL.

add_url(self, url: src.Scraping.URLBuilder.URLBuilder) -> None
Adds a URLBuilder object to the list of URLs to scrape. :param url: URLBuilder object to add. :return:

find_count(self, soup: bs4.BeautifulSoup) -> int
Finds the number of listings on the page. :param soup: Soup object to search for the count. :return: Number of listings on the page.

load_scraping_history(self) -> list[dict[str, typing.Union[str, datetime.datetime]]]
Loads the scraping history from the scraping history file. :return: List of scraping history entries.

save_scrape_date(self) -> None
Saves the date of the last scrape to the scraping history file. :return:

async scrape_data(self, progress_callback: Callable[[int], NoneType] = None) -> dict[str, pandas.core.frame.DataFrame]
Scrapes data from the URLs asynchronously. :param progress_callback: Callback function to update the progress bar. :return: Dictionary of data frames with scraped data.

update_url_list(self, config: dict) -> None
Updates the URL list with the given configuration. :param config: Configuration dictionary. :return:

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Data

Callable = typing.Callable
Union = typing.Union

Data
		Callable = typing.Callable Union = typing.Union