CAPTCHA avoidance
What is Captcha?
Captcha is a way for website owners to tell if the traffic on their website is genuine. It helps to distinguish human traffic from fake traffic and, in some cases, protects the data from website crawlers or other botting software.
When do I receive Captcha?
There are many ways to trigger Captcha. Most of them depend on the security of the website. Often, Captcha is met when filling a registration form on the website, visiting certain domains from public networks, refreshing the same page constantly, and so on.
What different types of Captcha are there?
There are many different types of Captcha you will or will not face while browsing the web. Most of these usually require to enter certain symbols seen on the screen, while others require to select pictures or solve a puzzle. Google provides the most popular and most often seen Captcha as reCAPTCHA.
How do I check if I am receiving Captcha through my code/bot logs?
There are many ways to identify whether you are getting Captcha or not. Here are some common signs:
- You are not getting back the requested content, or it comes partial.
- Your scraper/crawler returns a response with Captcha inside it.
- Your requests are timing out.
- Instead of 200 HTTP response codes, you are getting codes such as 40x, 50x, etc.
How do I avoid getting Captchas?
You may face many forms of Captcha and a lot of combinations in your actions to trigger them. It all depends on your setup, but here are some general tips to avoid Captchas while using a proxy network:
- If you are using a bot, try different endpoints or rotating ports for our service.
- Try randomizing your request times on the application if possible.
- If you are writing custom code for a scraper/crawler type of application, make sure that you have a huge list of different user-agents, which will help cover your tracks while visiting the website. A user-agent is a parameter sent with your request which gives you identity while visiting a certain website. Usually, it looks like the following:
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0
- Avoid or never use direct links in your bots that are not publicly available on the website page without looking at its source code.
- If possible, sway your traffic by visiting and following paths provided by the website rather than asking for a certain link directly constantly.
- Make sure that you limit your requests and are not causing damage to the website. This will instantly trigger more safety features than your code or application is prepared to handle, like Cloudflare shields.
- If possible, use a headless browser provided by such frameworks as Selenium.
-If writing custom code, check other headers you are sending and ones you are receiving. Sometimes there are certain HTTP libraries used in the requests that may give you away. Other parameters, such as cookies, are sent by a target website to make sure that your requests are genuine. - Check the website source code, make sure that your bot/crawler, etc., is rendering all the necessary elements, such as Javascript code.
Will proxies help me to solve Captcha?
If a Captcha is provided by the website on such pages as checkout/registration/password-change forms etc. avoiding it is most likely not possible even with a proxy. For such tasks, search for Captcha solver services or solve them on your own. Proxy network does not influence the Captcha appearance and is definitely not the tool to solve them.
Updated 12 months ago