pta-text-0.1 / README.md
gitlost-murali's picture
Update README.md
2e4268f verified
metadata
license: gpl-3.0
tags:
  - ui-automation
  - automation
  - agents
  - llm-agents
  - vision

Model card for PTA-Text - A Text Only Click Model

Table of Contents

  1. TL;DR
  2. Using the model
  3. Contribution
  4. Citation

TL;DR

Details for PTA-Text:

-> Input: An image with a header containing the desired UI click command.

-> Output: [x,y] coordinate in relative coordinates 0-1 range.

PTA-Text is an image encoder based on Matcha, which is an extension of Pix2Struct

Installation

pip install askui-ml-helper

Download the checkpoint ".pt" model from files in this model card. Or download it from your terminal

curl -L "https://huggingface.co/AskUI/pta-text-0.1/resolve/main/pta-text-v0.1.1.pt?download=true" -o pta-text-v0.1.1.pt

Running the model

Get the annotated image

You can run the model in full precision on CPU:

import requests
from PIL import Image
from askui_ml_helper.utils.pta_text import PtaTextInference

pta_text_inference = PtaTextInference("pta-text-v0.1.1.pt")
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = 'click on the text "Operating System"'

render_image = pta_text_inference.process_image_and_draw_circle(image, prompt, radius=15)
render_image.show()
>>> Uploaded image with "a red dot", where click operation is predicted 

image/png

Get the coordinates

import requests
from PIL import Image
from askui_ml_helper.utils.pta_text import PtaTextInference

pta_text_inference = PtaTextInference("pta-text-v0.1.1.pt")
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = 'click on the text "Operating System"'

coordinates = pta_text_inference.process_image(image, prompt)
coordinates
>>> [0.3981265723705292, 0.13768285512924194]

Contribution

An AskUI's open source initiative. This model is contributed and added to the Hugging Face ecosystem by Murali Manohar @ AskUI.

Citation

TODO