What you have experienced in your test case might be the normalization: Relevance is not counted by term frequency but by term frequency divided by document length - to give shorter documents a chance to be relevant.
That means, you have a normalized frequency of 1/1, 2/2 and 3/3 which are all equal 1 and thus the order seems random.
If you want to validate the query, I propose you test with real-world examples.
This one's a tricky requirement, I believe this can be achieved via custom predicate  where the sorting has to happen based on the number of occurrences(count) of a search term. Here's a another forum  somewhat similar but with pages where the requirement was search for occurrence of a search term only twice
Another thought on the requirement it self, relevance is hard to derive based on a single search term. however you could try using use boosts  for index similar to below. hope this helps!
jcr:contains(., 'jelly sandwich^4') In this example, the word "sandwich" has weight four times more than the word "jelly."