I would be interested to see how others deal with this also.
As a travel company, we have a large number of bots / competitors that are price scraping our site. In order to maximise compatibility between Adobe Tools (Workspace / Reports / Adhoc / Data warehouse etc) we have created a Virtual Report Suite that filters out as much traffic as possible. However, we do not actually filter on IP but rather on domain, Using this technique we have found that what we see is troublesome domains - rather than IP addresses.
Our main check is that we look at domains where we have >100 unique visitors in a week with a visits per visitor ratio of 1.0 and a conversion rate of 0%. This has also highlighted a few things to us:
User agent strings are often a good indicator of a bot
Adobes domain list is not awful but clearly not as as up to date as others
Query string parameters can also help identify a bot if they are manipulating on site search.
As for syncing this retrospectively with our data lake, well that's another problem 😕