Reddit webscraper

11/20/2023

The "township crawler" can pull each item from the queue, and fetch every street, and dump those into a queue. For each state, get all the townships (just 1 page hit) and push each township into a queue. I'd use a series of queues and lambdas, or make it a little recursive. See if you can pull your data using a simple regexp, or run the data through something like "jsdom" or "cheerio" and use XPath or query selectors to pull the data. For each "type" of target page you have, make a rule. After that, it doesn't appear you'll need to scrape through any javascript, meaning you can just fetch the HTML data and pull your data out of it. My approach would probably be to start with a list of links to each state. Find a lightweight way to extract your text like jsdom or cheerio. TL/DR all those links (after the 1st page, at a glance) appear to return the data you need without needing to run javascript. My plan was to use Lambdas and queues as well. I'll be tracking the price of 1000 product across about 12 websites. This caught my eye, as I may be scraping some data as well in the near future. Does anyone have any helpful links to tutorials/resources where people have written a massive web-scraping effort like this on AWS?

Alternatively, should I run this entire job on an EC2 instance? 3.
Lambda to save the data of a household in a queue of 500 lambdas at a timeĬurrently, I am using puppeteer-cluster on my local machine, but it isn't possible to have it running for days on my machine.
Get the links for street, launch a lambda to then visit that street page and launch another lambda for each housing link.
Get the links for each street listing, launch a lambda for each letter to see their streets (represented by letters A-Z).
Visit a town's street page, launch a lambda function for each street.
Currently, I plan to go through each town's database page, then get the list of street pages, and for each street visit the page of each house. I then plan to write these to a MongoDB database.Ĭurrently, I am planning to do this using a series of Lambda functions using AWS Lambda with a Queue in place as my MongoDB instance has a limit of 500 connections per limit. The first effort will be 1,000,000 pages, and then a small subset of those pages (~60,000) as I can check when the pages were updated before web-scraping efforts.

I am currently working on a massive web-scraping effort to scrape roughly 1,000,000 pages on a monthly basis using Puppeteer. If you're posting a technical query, please include the following details, so that we can help you more efficiently:ĭoes this sidebar need an addition or correction? Tell us here public IP addresses or hostnames, account numbers, email addresses) before posting! ✻ Smokey says: reduce car use to fight climate change! Note: ensure to redact or obfuscate all confidential or identifying information (eg. _sgmllib_(self, data)įile "build\bdist.win32\egg\mechanize_sgmllib_copy.py", line 110, in feedįile "build\bdist.win32\egg\mechanize_sgmllib_copy.py", line 144, in goaheadįile "build\bdist.win32\egg\mechanize_sgmllib_copy.py", line 302, in parse_starttagįile "build\bdist.win32\egg\mechanize_sgmllib_copy.py", line 351, in finish_starttagįile "build\bdist.win32\egg\mechanize_sgmllib_copy.py", line 387, in handle_starttagįile "build\bdist.win32\egg\mechanize_form.News, articles and tools covering Amazon Web Services (AWS), including S3, EC2, SQS, RDS, DynamoDB, IAM, CloudFormation, AWS-CDK, Route 53, CloudFront, Lambda, VPC, Cloudwatch, Glacier and more.

I am getting an error when i try to run it, i think the for loop is the problem but i'm not sure how to fix it.įile "C:\Users\nthomson\Desktop\123.py", line 22, in įile "build\bdist.win32\egg\mechanize_mechanize.py", line 499, in select_formįile "build\bdist.win32\egg\mechanize_html.py", line 544, in _getattr_įile "build\bdist.win32\egg\mechanize_html.py", line 557, in formsįile "build\bdist.win32\egg\mechanize_html.py", line 237, in formsįile "build\bdist.win32\egg\mechanize_form.py", line 844, in ParseResponseExįile "build\bdist.win32\egg\mechanize_form.py", line 981, in _ParseFileExįile "build\bdist.win32\egg\mechanize_form.py", line 758, in feed Soup = BeautifulSoup(br.response().read(), 'lxml') I've set up a small piece of code for it from data. Then scrape 2 cells of data off of the page with beautiful soup, then output another csv file with address + owner. My thought process was, input a csv file with numbers, mechanize to get through the login page then fill out the forms with each number. I've essentially been given a set of numbers (300) to look up in a website, then i need to record the address of the number and who owns it. I've been given a dull task to do at my job and i was going to try and automate it and i'm getting stuck.

0 Comments

Author

Archives

Categories

Reddit webscraper

Leave a Reply.