On Rate Limiting and Abuse
NOTE: This is now an issue on tootsuite’s mastodon repository at GitHub. See GitHub: Issue #8575 on tootsuite/mastodon: "Abuse-prevention: rate limits / 'storm shield'" for the issue there.
Do note that this page is still the “full” version; GitHub issues aren’t really intended for a full discussion in a single post. That said, at this point, all technical comments should go to the issue itself, I think. That way discussion moving forward is in one place.
NOTE: All updates will appear at the bottom of this document. So far, three updates have been made, the most recent one at 2018-10-03 08:23:55 -0400 EDT.
So, for the second time, I suggested using Hashcash or a similar system in order to prevent abuses on the ActivityPub fediverse. And for the second time, the idea was shot down as being “wasteful”. Since I’m done debating this in the fediverse (too much to say, not enough space), I’m going to write my full argument here.
Also: I’m not averse to hearing better ideas. But those ideas have to accomplish at least as much as this one does, and ideally in a more performant manner. I’m all about efficiency, as anyone who has ever paid me for code can attest to. And I don’t consider efficiency to be a single-tiered variable: it must be a global consideration. It encompasses development time, runtime, robustness, and ease of use. A proposal which is better in some or all of these fronts, and worse in none, is what I’m hoping to hear.
This all having been said, I fully intend to implement this in an ActivityPub server that I’m creating.
TL;DR: @firstname.lastname@example.org was quite literally forced out of the ActivityPub federated network by many users across several instances of ActivityPub implementations (mostly, if not entirely, Mastodon instances). The conditions of the network enabled this to happen, and no mechnaisms were available to halt its progress. The root of this can be easily identified, as well: there was literally no ability for @email@example.com to keep up with the assult, due to insufficient tooling and resources. All the manpower in the world would have not made a difference here, either: we need automated tools that can be put in the hands of the abused, and not torn apart by the abusers.
What is needed?
To that effect, what is needed is a system that satisfies the following requirements:
- It MUST NOT place an undue burden on the server, even if it is enabled for all of the users on the server.
- It MUST NOT be difficult to use or expose technical details to the end-user.
- It MUST be unavoidable: it must not be possible to circumvent or evade the system. This eliminates the use of limits on a per-IP, per-session, or per-client basis as viable options.
- It MUST be easily implemented: programmers will not be motivated to add the required support code in either server-side or client-side software if it results in a painful migration. Therefore, it must be a simple addition to the protocol, and not a complicated one.
- It MUST be as stateless as possible. The correct solution must apply equally to a single ActivityPub instance housed on a single ActivityPub server, as well as a single ActivityPub instance housed on a cluster of cooperating servers.
What about a simple server-side rate limit?
Since this is the first thing that is counter-suggested, I’ll address it early. Such a mechanism requires more state than the proposal that follows, and it requires more shared state between individual servers in a cluster. What does that mean? It means adding yet another component to a system which is already showing stresses, and guarding that state with distributed locks. It could be done, but it would absolutely fail to scale properly. Contention becomes an issue in a system like this when it grows beyond a few thousand concurrent actors, and there are in some cases 10,000+ actors in a single instance. So, this cannot be made to work.
Given that it would be opt-in, it is reasonable to assume that it would bring little impact. But what if an entire instance wants this turned on? What if majority of the users of an instance want to benefit from the feature? This would result in aches and pains felt by the administrators (slow response issues everywhere without a ready explanation in RAM or processor usage), downtime, denial of service, and so forth. In fact, a denial of service attack would be wonderfully easy to carry out: create 2,000 accounts on an instance and bog it down with rate limits on all of them.
Clearly, that cannot work in a network of this size. We’re not talking about the email for a small business, which might transit 1,000 messages in a day. An ActivityPub implementation is a JSON document processor which cryptographic elements present within it. It’s a little more effort than parsing an Internet mail message (see RFC 5322 and RFC 6854 for the details of the format of an Internet mail message, if you are unfamiliar). And this is a good thing: it means that many of the primitives on which a reasonable solution can build are already present, and so the resulting addition to the code bases would be minimal.
So, then, what would work?
At the core of my proposal is to use some sort of proof of work; something like Hashcash, if not Hashcash itself. I say this because there are many types of proof-of-work, and not all of them are appropriate for use in a situation like this.
The most common type of proof of work familar to developers is that of the challenge-response. It is commonly used in authentication systems, such as Kerberos or digest authentication. It is relatively low-overhead in isolation, but many high-traffic Web servers have already done away with digest authentication in use, particularly API endpoints, in favor of authentication methods that require less state tracking and fewer network round trips. Credentials passed this way are also frequently used for authorization; the Web server knows who it is, but now has to look up in a database of some sort whether or not the user is permitted to perform the action.
A challenge-response system could be used, but it would fail in three major ways:
- It would introduce undue burden to the server. The server would be required to track the state of every issued challenge, and due to various circumstances a non-zero number of these would live out their entire lifetime before expiring. The consequence of this can be that a bad actor can render the system unusable by way of exhaustion of resources (outstanding unanswered challenges).
- It would introduce undue burden on client developers, because the entire workflow for posting a message to a user’s inbox would be modified. It would introduce extra undue burden on the server developers, for the same reason. The point: it would not be a trivial addition. Furthermore, it complicates other modes of failure: did the message get there, or not? So not only does it double the number of requests in the best case, but in the worst case (e.g., bad network connection) it can result in far more than that.
- It places a very strict upper bound on the number of concurrently active message recipients, particularly in the event that many of them opt-in to such protection.
The other type of proof of work is the problem-solution type. The problem-solution type works something like the following:
- Sender composes a message (data, JSON, email, whatever).
- Sender selects a problem from the agreed-upon problem space (which may be specified by mutual agreement, by protocol, by fiat, or by some other mechanism entirely).
- Sender computes a solution to the selected problem.
- Sender transmits (problem, solution, message) to the recipient.
- Recipient verifies that the problem, solution, and message are all valid and in agreement. If this verification fails, the message is simply ignored; else the message is passed through as having met the authorization criteria.
Unlike challenge-response, which is typically used both for authentication and authorization, problem-solution is typically used only for authorization (yes, there are ways to use it for authentication, and some of those methods are even somewhat common; but ActivityPub already has authentication of messages, and so we’re only considering authorization here). Perhaps the most well-known use of this type of authorization is in blockchains, where the “winning hash” is a bearer token to be appended to the blockchain with its associated block.
Also unlike challenge-response, problem-solution algorithms scale A LOT. And for a problem domain such as the one which provides the context for this article, it is an almost perfect solution. So close to it, in fact, that despite having spent a lot of time racking my brain and the Internet to find something better in the past week since I suggested this the first time, I am really unable to find something that provides the same sort of characteristics as this type of solution.
As you probably already guessed if you’re read this far, Hashcash is one of the members of this family. It’s not the only member of this family, though; there are others. Many of them are extremely complicated systems which are overkill for something like this.
That is where Hashcash comes in.
But isn’t Hashcash Bad?
For blockchain applications where the blockchain is distributed globally and everyone wants to find the next block… it’s awful. Atrocious. Wasteful. But it was the first method used on a blockchain, and we’ve found better ways to handle that level of scale.
But let’s consider this: the ActivityPub federated universe will never scale to that type of size, for starters, and no entity within the federated universe will become so popular that they’d require exahash, petahash, or even terahash-level power. And that’s where the waste typically and wrongfully associated with Hashcash is found: in the fact that as of the time I’m writing this, the Bitcoin network is at 46.4 exahashes per second.
That’s insane. That’s an incomprehensibly large number of operations to most people, it is nearly unfathomable. That number looks like this (to three significant digits): 46,400,000,000,000,000,000 hashes per second. That many hashes are being computed by the Bitcoin network miners on average per second in order to try to find one single block every ten minutes.
So it’s not hard to understand why someone would see Hashcash and knee-jerk about it if they do not have a complete understanding of what Hashcash is and how it is (and is not) related to Bitcoin.
It also is proof of just how well it works, despite its massive power usage in an application such as a cryptocurrency. It enforces a rate limit of 1 block per each approximate 10 minute interval, globally.
Think about that for just a minute, and let it sink in.
Hashcash is used to limit the blockchain’s growth to one block per ten minutes, on average, world-freakin-wide. And the hash value required is adjusted once every approximately 2,000 blocks in an effort to maintain that fixed rate of growth. And it works.
But it wastes power, doesn’t it?
As with literally anything: how it is used determines what it does, how it behaves. Let’s start with what we know already:
- A Bitcoin block takes approximately 10 minutes to appear.
- In that time, 27,840,000,000,000,000,000,000 hashes are collectively performed. (For scale, this is being written on a 16x core/32x thread 64-bit processor, which can do just 12,600,000 per second using its CPU; it would take my computer many times longer than my lifetime to perform that many hashes using only its CPU.)
- Clearly, that’s a lot of electricity usage. Nobody can disagree with that.
And what does that electricity usage generate?
- 12.5 BTC (at the time of writing, approximately equal to 87,800 USD)
- Plus transaction fees found within the block, based on network load.
- Official inclusion of whatever transactions are present in the submitted block on the blockchain. For these transactions, this is the first confirmation.
- All transactions which were recorded in previous blocks have their confirmation count increased by one.
- All nodes receive the new block; all miners begin work on a new block by taking transactions from the memory pool and beginning work again.
So, all that power isn’t being used because of Hashcash. It is actually being used because the Bitcoin blockchain does not want to have a new block every second or ten seconds; it wants to have a new block once every ten minutes, which means that the Bitcoin blockchain will only grow by about 144 blocks every 24 hours, on average, literally everywhere on planet Earth.
That necessarily takes a lot of something to provide for its security; in this case that something is electricity.
The Proposal, Formally
NOTE: Please see the updates section below after reading this section.
So, then, here is the proposal:
- Implement Hashcash or something similar.
- Method #1
- Implement a preference, it could be named “Mob Protection”, “Spam Resistance”, or something similar. Boolean, default off.
- Implement a preference named “Resistance”, represented using a slider. The slider would control a numeric value between 0 and 3600. Higher values create higher barriers to entry (require more “postage”).
- Method #2
- Have a “panic button” which activates the feature with a target delay of 1 second; each button press would increase the target delay by 2—4 seconds. User stops hitting the panic button when the assault is no longer felt.
- Add a field to represent the message’s “postage stamp”.
- Optionally, add an info bar at the top of the user’s page when they have logged in using the Web client, informing them that they have the feature enabled and should disable the feature as soon as is practical.
- There should be a self-cancellation limit on the feature; perhaps 72–144 hours. This would be required for the panic button version of the implementation.
Since the default would be “off” there would be little to no impact at rollout, except for the new feature’s appearance post-upgrade.
If an account has enabled the boolean preference described above:
- A client submitting a message with no or insufficient postage would receive an error which signals that a particular “amount” of postage (leading zeroes) is required in order to make the delivery.
- The postage is computed and attached, and the client transparently attempts redelivery of the message.
- If the user has not increased the slider, the message is accepted because it now has sufficient “postage”.
But this only limits the client-side posting rate, doesn’t it?
Yes, it does. But more importantly, it has an effect on the posting user. It is a well-known fact that users who think that something is “being slow” are going to give up out of frustration and move on to something else.
There are a few reasons why this makes Hashcash appealing:
- This gives the owner of an inbox the ability to control the receipt of messages. Currently, pretty much the entire ActivityPub federated universe has control of how flooded or not an individual user’s inbox becomes, and that is clearly not acceptable.
- It does not increase the server’s load, on the average, because:
- Clients which behave (honor the postage) will have to generate the hash before attempting resubmission of a message rejected for insufficient postage. A client which behaves and submits a proper postage with the message has done nearly all the work; the server can verify without actually redoing the work, in an expedient and efficient manner.
- Clients which do not behave are easily detected and can be automatically blocked, at the IP layer, because their pattern “sticks out” and can be considered unusal and indiciative of potential abuse.
- The “scale” at which this operates is tiny: an individual must opt-in (manually toggle the preference on) before anything changes for that account. An individual must also change the slider from zero to a non-zero value for it to become effective. The impact is strictly limited to users who mention or direct message the user, and nobody else.
Perhaps the only unappealing thing is that there is no way to know what the slider’s setting should be at for any given scenario: every one would be different. Would a 10 second cooling period have disspated the mob in the case of @firstname.lastname@example.org? Maybe. What is for sure is that this feature, or something like it, does not exist now. If it did exist, it would grant users additional abilities in controlling their own inbox, at (nearly) no cost to the instance itself: the only additional cost to the server is to reject a message when it has insufficient postage.
OK, but how is this better than server-enforced rate limits?
This is important to understand: this feature is being suggested in order to allow an individual to protect its inbox against assult by other entities within the entire federated universe.
It is not intended to be always-on.
It is intended to be used strictly as a response to a “storm” directed at a single user, as was the case for @email@example.com.
So this means that an instance should be able to keep its (very low, possibly zero) members who have turned this feature on in a small, in-memory table which contains the user’s local name and the number of zero bits required.
This also means that if an instance has a high percentage (like, more than 1 or 2% out of a population greater than 100) of people using this feature, something is wrong and this becomes a useful flag to the administrator that this is the case.
Essentially, if the feature incurs any noticable burden on the server at all, it is because it is host to a large number of people who are either paranoid, or on an instance which is hostile, uncontrolled, or as the Mastodon blocklist says, is a “free-speech zone”. In that case, the instance administrator knows about the costs that it is incurring and likely has to do a lot to keep its personal entertainment running in the first place.
So how is it better than using the database as a lookup source? Because:
- A mobile client may change cells or regions while waiting for the postage to be generated. While cells handoff easily enough, crossing region boundaries will reset all TCP connections. But, this is transparent if the client sends the message again, as opposed to holds open a connection.
- Erratic, improper, abusive, or pathological behaviors become more easily detected and handled by automated systems, reducing the burden on the server as well as the server’s administration.
- Abuse of the feature has only one result: nobody will talk to that person. Effectively, very little power will be wasted because unlike Bitcoin miners, humans at the keyboard/smartphone have little patience and move on quite quickly.
- In no implementation will it ever be possible for the server to decide to wait, and no programmer will be driven to use timer resources for rate limits anywhere in the system because they will become unnecessary.
Any other solution, to scale, would require additional middleware to offload it from the Mastodon application, increasing the management burden to maintain any size instance that federates.
Simply put: it puts additional control in the hands of the receiver of messages, while at the same time only incuring any cost whatsoever if/when a user enables it.
The feature should be big and scary looking, like a big red button that shuts down a data center. It should be stupidly clear that it is enabled, and even if all that is ignored, the impact is limited to the user who doesn’t want anyone talking to it in the first place. Clients give up quickly; far more quickly than the maximum delay would be able to be set at.
I encourage feedback and discussion of this. I’d like to see this feature, or something like it that empowers the user to control its own inbox. If not this idea, than something else which scales as well as it or better (I don’t think that the impact on the server can be made any more minimal) and gives as much or more control to the user over the user’s own inbox.
2018-09-01T13:59:14-04:00: I think that the word “protection” carries with it the implication that the feature should be always on. Another name would be better suited; perhaps “Storm Shield” or something. I don’t know. That sounds cheesy.
2018-09-01T22:57:34-04:00: A side effect of this is that work is reduced on the administrators/moderators of an instance. I’m not entirely sure if this is a net positive or net negative in the even larger picture. But I remain convinced that the targeted user is the most important thing. I see this as a potentially useful side effect: it gives the people who are performing the harassment something of a chance to consider their behavior and maybe improve themselves before they themselves become reported. Thanks to @firstname.lastname@example.org for pointing this out!
Thanks to @email@example.com for the inspiration here. This has been integrated into the document above.
2018-09-02T01:05:20-04:00: Additionally, a self-cancellation timeout could be implemented. This should, to reduce burden on the server, be implemented as a simple integer count of seconds which is stored along side the target value, alongside a timestamp indicating when the function was enabled so that efficient checks for expiration are possible. A period of 72–144 hours would seem to be reasonable.
Again, thanks to @firstname.lastname@example.org for the inspiration here. This has been integrated into the document above.
2018-09-02T01:05:20-04:00: An alternative idea for the UI: instead of two widgets (on/off switch, slider) as proposed above, the implementation could have a “panic button”. When this panic button is it, it would introduce a small delay (say, 3 seconds). If the effects of the storm subside, this is all that would then be needed. Each time the button is hit again, the target delay increases by three seconds, perhaps with an upper limit (at which point the button becomes disabled/insensitive/inactive). Do note, however, that this depends on the self-cancellation timeout described above.