Form spam is a problem most web developers have to deal with at some point. It’s often a matter of days after a site is indexed by Google that a site’s forms start getting hit by spambots trying to advertise dodgy products, trick you into clicking dodgy links or trying to get links back to dodgy sites.
This is often an issue when you manage a site for a business. After the site is marketed and gets ‘out there’, the client gets annoyed by spam and wants something done about it.
Many developers resort to the quick fix of a captcha or a captcha is requested by a client because they have seen it elsewhere. However, captcha’s are not user friendly and can be a real turn off for visitors with a short attention span. How many times have you been frustrated by repeatedly having to complete a barely legible captcha test?
Instead, there are quite a few methods that don’t impact on normal users at all, especially when used in combination:
Simple server-side validation will eliminate a lot of automated spam. Spam bots don’t hang around to change how they format their submission, they just move on to the next easier target.
Checking the format of an email address is probably one of the most common validation operations and, as a result, a lot of spam bots will recognise email fields and enter dummy addresses.
But what about the other fields in a form? What if we spent some time validating them?
For example, a name field should probably only contain characters from A to Z, spaces, hyphens and periods. It should also have a limited length, maybe 30/40 characters.
A phone number should only include numbers, periods, hyphens and commas depending on where in the world you are.
Validation even has a role to play in non-specific fields such as a contact form message. Rejecting messages containing HTML tags, URLs (or too many URLs) and too many characters can be very effective.
The speed of submissions is another factor that can identify spam. It is relatively straightforward to use sessions to store the time when a form is requested and the time when it is submitted.
Very fast submissions (less than 1 second) or submissions with no start time (i.e. data was posted straight to the form URL) indicate that it wasn’t a human completing the form.
Honeypot fields are fields that are hidden from human visitors by CSS but are found by spam bots. If a form submission has content in the honeypot fields it can be rejected as spam.
This is a very effective, yet simple to implement technique. There can be issues with accessibility – screen reading software may identify honeypot fields – but this can be overcome by clearly labelling the fields as being present to catch spam and hiding the labels as well.
A session token is a cryptographic hash stored in the session variable server-side when a page with a form is served and repeated client-side in a hidden field in the form.
When a form is submitted, the two tokens are compared and if they don’t match, chances are it is a spam submission directed straight at the page.
The one downside of this approach is that the session can time out (e.g. if a user leaves the page open for a long time before submitting a form), but this can be handled by providing an error message, replacing the content in the form fields and asking the user to re-submit.
IP, Sender & Content Filtering
Maintaining a list of spammer IPs and detecting spam content by filtering is another effective method but takes a lot of work and many submissions.
Luckily there are services that do it for you, like Akismet for WordPress.