@ayushn adobe released new feature AEM remote Spa with next js last year.. it is just awesome.. here you will use either react ssr or react csr with next js app and use aem editor functionalities as-is for content management.
https://experienceleague.adobe.com/docs/experience-manager-learn/getti...
Please refer to tokenizer section in the below blog post, there are already multiple tokenizers available.. why do you want to build custom one?
https://www.bounteous.com/insights/2018/06/07/aem-search-indexing-synonyms-filters-and-stop-words-oh-my
@reno1 as posted by @aanchal-sikka and @sherinregi it is not possible with custom git, please use Adobe git for the same...
@reno1 any reason why you dont want to use Adobe git?
@j408 have you seen https://www.aemcq5tutorials.com/tutorials/text-pre-extraction/ and https://www.tumblr.com/cq-ops/17803840654/how-to-extract-content-and-metadata-from-pdfs