• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

Miffyli/im2latex-dataset: Python tools for creating suitable dataset for OpenAI ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

Miffyli/im2latex-dataset

开源软件地址(OpenSource Url):

https://github.com/Miffyli/im2latex-dataset

开源编程语言(OpenSource Language):

Python 100.0%

开源软件介绍(OpenSource Introduction):

im2latex-dataset

Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex. You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly isn't quite enough for this task.

Note: This code is very ad-hoc and requires tinkering with the source

Contents

  • /src/latex2formulas.py
    • Script for parsing downloaded latex sources for formulas. Stores formulas in single .txt file (one formula per line)
  • /src/stackexchange2formulas.py
    • Similar to latex2formulas.py, but for parsing StackExchange XMLs.
  • /src/arxiv2formulas.py
    • Similar to latex2formulas.py, but for parsing arXiv .tar/.tar.gz files (source downloads).
  • /src/formula2image.py
    • Creates images and dataset from a file of formulas
  • /src/im2latex_utils.py
    • Collection of misc functions for handling these formulas
  • latex_urls.txt
    • Text file containing urls to LaTeX dataset from here. Use wget -i latex_urls.txt to download these files.

Dependencies

  • Python 2.x or 3.x (only ran on 2.x, should work on 3.x too. Haven't tried running on Windows)
  • For running the script with current settings and generating full-page images:
    • Properly installed LaTeX-to-PDF chain (eg. calling pdflatex outputs .pdf for .tex file)
    • ImageMagick installed so that convert command works
  • For creating more compact images of formulas (image cropped so that formula fits)
    • textogif and its dependencies
    • textogif needs to be placed in same directory where images are generated, otherwise it won't work.

Building your own dataset

  1. Download bunch of LaTeX sources packed in .tar files (by using the latex_urls.txt, for example)
  2. Run python latex2formulas.py [directory where .tars are stored]
  3. Run python formula2image.py [path to generated formula text file]
  4. Run python formula2image.py [dataset_file] [formula_file] [image_dir] to confirm dataset is valid
  • The end result should have two files and one directory (names can be changed in formula2image.py:

    • im2latex.lst
      • Each line is in format formula_idx image_name render_type
        • formula_idx is the line number where formula is in im2latex_formulas.lst
        • image_name is the name of the image connected to this rendering (without '.png')
        • render_type is the name of render setup used, defined in formula2image.py
    • im2latex_formulas.lst
      • Each line contains one formula
    • /formula_images
      • Directory where images are stored
  • Sometimes pdflatex gets stuck inside an infinite loop when compiling an image.

    • To fix this you need to manually kill stuck pdflatex processes, otherwise script won't end

Issues and possible TODOs

  • If pdflatex is used with convert this will generate pictures of whole page

    • While this might be a good thing (eg. fixed input size), it might also severly slow down training
  • textogif generates smaller images but these will have varying dimensions.

  • Possible TODOs:

    • Finish tokenizer function / output list of tokens instead of raw formula in formula list
    • Add accuracy metric (eg. word-error-rate or similar).
    • Combine ...2formula.py scripts into one, or at least make system more sensible rather than bunch of separate scripts.

Ultimate goals (Update: Likely not going to happen, but kept here as a food for thought)

  • To provide dataset suitable for solving im2latex task
    • So people can compare performances between systems
  • To provide the tools used to generate said dataset
    • So people can generate different kind of images (quality, size), different formulas (different fonts), etc
  • Misc tools for handling the datasets
    • TeX Math tokenizer (possibly)
    • Performance metric (takes list of true formulas and list of estimated formulas, outputs performance/accuracy)
    • Tools for modifying the images in wanted way



鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap