Honestly, this seems almost an impossible task to do in realtime. Maybe, what could work is using some ML model for detecting text and then do a second pass trough the detected regions, so some preprocessing and run an OCR algorithm.
Even with this, i think the final result will be garbage anyway
Not sure what is your end goal, but if you want differentiate between players/stones/mobs, what i would do is train a YOLO model for it.
Label around 1000 images with a 2d box and train the Neural Network it will give awesome results in realtime and nowadays this is ferly easy do develop.