Designing a Vision Transformer based Enhanced Text Extractor for Product Images.

COMAD/CODS(2023)

引用 0|浏览12
暂无评分
摘要
Product images, such as those which appear in e-commerce sites, exhibit unique characteristics that are typically not present in natural images. The primary distinguishing characteristic is the presence of text (e.g., brand names, price, constituents) along with high local entropy (i.e., too much visual information in the form of both text and brightly coloured pictures condensed in a small region). Extracting the text from these images may have multiple benefits: catalogue enrichment, product matching, offensive content identification, and more. However, the images are sometimes unclear and blurry where it is difficult to recognise the text even with human perception, and these texts are often written in non-standard fonts (at times each character in a word has a different colour and/or style), or are oriented at odd angles or appear on curved surfaces; moreover, many of these words such as, the brand names, do not appear in dictionaries. In this work, we present a vision transformer based text extractor that can handle the aforementioned challenges for product images effectively, and outperforms our earlier model considerably. We further compare our new end-to-end text extraction solution with those of Google and Azure text extraction cloud offerings, and showcase its efficacy both in terms of accuracy and latency.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要