have you ever wondered how websites collect massive amounts of information? or how businesses extract valuable insights from the vast expanse of the internet? look no further – we've got you covered in this comprehensive 5-minute guide to data scraping.
by the end of this video, you'll have a solid understanding of what data scraping is, the best tools and techniques and anything you should be careful of.
if you have any questions you would like answered as part of our 'so' series, please get in touch and let us know.
our podcast, 'the digital marketing podcast' is now in video format on our youtube channel. make sure to subscribe for a brand new video podcast every week!
transcription:
so, what is data scraping?
data scraping is the process of taking information from existing websites and places online and then pulling it together. it’s a way of bringing information together that you can't access necessarily very easily or putting it into a way that you can then manipulate it and that's the important bit going forward.
so, how can you then use this data?
say i want a list of the top 100 grossing movies of all time, and i want to know who the directors were, who the lead actor was and how much money it made. okay, so i would go and look and if that's on a web page, i can't do anything with it.
whereas if i can scrape that data off, i can then put it into excel and i can manipulate it. i could put it into power bi or a tool like looker studio, which is what google data studio is now called. i can manipulate it and visualize it and things like that. what's interesting is that suddenly with things like chatgpt, i can go through and say, analyze this data for me and manipulate this data and so on as well.
so suddenly i can start mixing these things up, but i need to get the data in the first place. so that's, that's what we can start to do with it.
so, what about tools for helping you do this?
so there's a great chrome extension called data miner. and with data miner, you give it a recipe. you would go through and say, okay, here's the webpage. this bit's the title, this bit's the value etc. and you go through and it'll understand the page, click go, and it'll go straight through, thousands and thousands of lines if you want to do. it can go to the next page, and you could go through a whole series of pages, and it'll put that data into a spreadsheet for you. so that's one way of doing it.
the other way is now chatgpt, because of having browsing built into it, and the plugins built in, you can say, go to this website and get this data and you can get it to pull it for you, and then you can use another plugin to visualize it. so you can start to do some really exciting things between something like dataminer and chatgpt with plugins and so on, to kind of get insights into data that you might not have had otherwise.
there's some quite interesting examples of tools within seo that'll go out and they'll scrape your competitors websites and tell you what's changed. you know, things like that are quite useful to see, like, keep an eye on the competition and see, well, what, what are they updating? what new articles that they published, what's gone out and through social media.
so i think definitely check out some of those, tools like competitor.app. very, very good for that.
so, is there anything to be careful of?
yeah, a lot of this stuff will be copyright. you can't just take other people's stuff and use it elsewhere.
you can take it and analyse it and read it and all those kind of things, so you might find it's good for getting insights, but be really careful if you're scraping other people's stuff and then reusing it, there's all sorts of copyright issues you could potentially get into as well, so you've got to be really careful of that.
there are data scraping tools that will go out there and find content that's been plagiarized, and they'll check across like thousands of like different references. one of the common forms of this is when people have scraped photos.
if you scrape an article and publish it on your website and it's using a photo that you don't have permission for, you know, any one of the big photo libraries can slap you with an eight or nine hundred pound fine. and you'll find it difficult not to pay it. that’s per infringement by the way. so if you've got a data scraping tool that's pulled a whole bunch of images, you're in a whole world of pain. and with ai generated images now, there's no need to be doing this kind of stuff anymore anyway.
so, are there any other good resources?
yeah, if you just google “what is data scraping” you will find our article on targetinternet.com at the top there, and you've got a whole step by step guide, you've got some example videos, you've got a load of resources that will help you out. specifically written for marketers and how they might want to use it.