c# typescript_C＃还是Java？ TypeScript还是JavaScript？基于机器学习的编程语言分 ...

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› TypeScript›TypeScript教程

c# typescript_C＃还是Java？ TypeScript还是JavaScript？基于机器学习的编程语言分 ...

原作者: [db:作者] 来自: [db:来源] 收藏邀请

c# typescript

Befunge, only known to very small communities.Befunge等深奥的语言，只有很小的社区才知道。

Figure 1: Top 10 programming languages hosted by GitHub by repository count 图1：按存储库数量计，GitHub托管的十大编程语言
One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.
GitHub面临的必要挑战之一是能够识别这些不同的语言。将某些代码推送到存储库后，重要的是要识别为搜索，安全漏洞警报和语法突出显示而添加的代码类型，并向用户显示存储库的内容分布。 Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. Linguist是我们目前在GitHub上用于检测编码语言的工具。 Linguist是一个基于Ruby的应用程序，它使用多种策略进行语言检测，利用命名约定和文件扩展名，还考虑到Vim或Emacs的模式行以及文件顶部的内容(shebang)。语言学家通过启发式方法来处理语言歧义消除，如果失败，则通过对一小部分数据样本进行训练的朴素贝叶斯分类器来解决。
Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.
尽管Linguist可以很好地进行文件级语言预测(准确度为84％)，但是当文件使用意外的命名约定时，并且至关重要的是，没有提供文件扩展名时，其性能会大大下降。这使得Linguist不适合GitHub Gist或自述文件，问题和请求请求中的代码片段之类的内容。
In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance.
为了使语言检测在长期内更加强大和可维护，我们基于人工神经网络(ANN)架构开发了一种名为OctoLingua的机器学习分类器，该分类器可以在棘手的情况下处理语言预测。该模型的当前版本能够对GitHub托管的前50种语言做出预测，并在准确性和性能方面超过Linguist。

OctoLingua背后的基本要点 (The Nuts and Bolts Behind OctoLingua)

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language.
OctoLingua是使用Python，Keras和TensorFlow后端从头开始构建的，并且构建准确，健壮且易于维护。在本节中，我们描述了OctoLingua的数据源，模型架构和性能基准。我们还描述了添加对新语言的支持所需的内容。

数据源 (Data sources)

The current version of OctoLingua was trained on files retrieved from
当前版本的OctoLingua接受了从Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.Rosetta Code和内部众包的高质量存储库中检索到的文件的培训。我们将语言集限制为GitHub上托管的前50种语言。
Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a
Rosetta Code是出色的入门数据集，因为它包含以不同编程语言表达的同一任务的源代码。例如，生成Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.斐波那契序列的任务用C，C ++，CoffeeScript，D，Java，Julia等表达。但是，跨语言的覆盖范围并不统一，其中某些语言仅包含少量文件，而某些文件则过于稀疏。因此，有必要使用一些其他资源来扩充我们的培训集，从而大大提高语言覆盖率和性能。
Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist.
现在，我们添加新语言的过程是完全自动化的。我们以编程方式从GitHub上的公共存储库中收集源代码。我们选择满足最低资格标准的存储库，例如具有最少数量的派生，涵盖目标语言和涵盖特定文件扩展名的存储库。在此阶段的数据收集中，我们使用Linguist的分类确定存储库的主要语言。

特点：利用先验知识 (Features: leveraging prior knowledge)

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:
传统上，对于神经网络的文本分类问题，经常使用基于内存的体系结构，例如递归神经网络(RNN)和长期短期记忆网络(LSTM)。但是，由于编程语言在词汇，注释样式，文件扩展名，结构，库导入样式和其他细微差别方面存在差异，因此我们选择了一种更简单的方法，该方法通过以表格形式提取一些相关功能来利用所有这些信息。我们的分类器。当前提取的功能如下：

Top five special characters per file
每个文件的前五个特殊字符
Top 20 tokens per file
每个文件的前20个令牌
File extension
文件扩展名
Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons
存在源代码文件中常用的某些特殊字符，例如冒号，花括号和分号

人工神经网络(ANN)模型 (The Artificial Neural Network (ANN) model)

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend.
我们将以上功能用作使用Keras与Tensorflow后端构建的两层人工神经网络的输入。
The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.
下图显示了特征提取步骤为我们的分类器生成了n维表格输入。随着信息沿着我们网络的各个层移动，它会通过删除进行正则化，并最终产生一个51维输出，该输出代表给定代码以GitHub上前50种顶级语言中的每种语言编写的预测概率以及该概率不是用任何这些写的。

Figure 2: The ANN Structure of our initial model (50 languages + 1 for «other»)图2：我们初始模型的ANN结构(50种语言+ 1种其他语言)
We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.
我们使用了90％的数据集进行了大约八个时期的训练。此外，在训练步骤中，我们从训练数据中删除了一定百分比的文件扩展名，以鼓励模型从文件的词汇中学习，而不是过度预测文件扩展名功能。

绩效基准 (Performance benchmark)

OctoLingua vs. LinguistOctoLingua与语言学家
In Figure 3, we show the
在图3中，我们显示了OctoLingua和Linguist在相同测试集上计算出的F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). F1得分 (精确度与召回率之间的谐和平均值)(来自原始数据源的10％)。
Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a ".txt" extension and a Python file may have a ".java") extension.
在这里，我们显示三个测试。第一个测试是以任何方式保持测试集不变。第二项测试使用了相同的测试文件集，但文件扩展名信息已删除，而第三项测试也使用了相同的文件集，但是这一次，文件扩展名被打乱，以致混淆了分类符(例如，Java文件可能带有“。 txt”扩展名和Python文件可能具有“ .java”扩展名。
The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).
在我们的测试集中加扰或删除文件扩展名的直觉是，当关键功能被删除或造成误解时，评估OctoLingua在文件分类中的鲁棒性。不高度依赖扩展名的分类器对要点和摘要进行分类非常有用，因为在这种情况下，人们通常不提供准确的扩展名信息(例如，许多与代码相关的要点具有.txt扩展名)。
The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.
下表显示了OctoLingua如何在各种条件下保持良好的性能，这表明该模型主要从代码的词汇中学习，而不是从元信息(即文件扩展名)中学习，而Linguist会在文件扩展名中的信息一经出现就立即失败。改变了。

Figure 3: Performance of OctoLingua vs. Linguist on the same test set图3：在同一测试集上OctoLingua与语言学家的表现 Effect of removing file extension during training time在培训期间删除文件扩展名的效果
As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time.
如前所述，在培训期间，我们从培训数据中删除了一部分文件扩展名，以鼓励模型从文件词汇表中学习。下表显示了我们的模型的性能，其中在培训期间删除了不同部分的文件扩展名。

Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations图4：OctoLingua的性能，在我们的三个测试版本中删除了不同百分比的文件扩展名
Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features.
请注意，在培训期间未删除任何文件扩展名，与常规测试数据相比，OctoLingua在没有扩展名和随机扩展名的测试文件上的性能将大大降低。另一方面，当在删除了某些文件扩展名的数据集上训练模型时，修改后的测试集的模型性能不会降低太多。这证实了在训练时从一小部分文件中删除文件扩展名会使我们的分类器从词汇中学习更多。它还显示了文件扩展名功能，尽管具有很高的预测性，但有一种占主导地位的趋势，并且阻止了将更多权重分配给内容功能。

支持新语言 (Supporting a new language)

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.
在OctoLingua中添加新语言非常简单。首先从获取新语言的大量文件开始(我们可以按照数据源中的描述以编程方式进行此操作)。这些文件分为培训和测试集，然后通过我们的预处理器和特征提取器运行。新的培训和测试集已添加到我们现有的培训和测试数据池中。新的测试集使我们能够验证模型的准确性仍然可以接受。

Figure 5: Adding a new language with OctoLingua图5：使用OctoLingua添加新语言

我们的计划 (Our plans)

As of now, OctoLingua is at the «advanced prototyping stage». Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages.
到目前为止，OctoLingua处于“高级原型开发阶段”。我们的语言分类引擎已经强大且可靠，但是尚不支持平台上的所有编码语言。除了扩展语言支持(这将是非常简单的)之外，我们还旨在实现各种粒度级别的语言检测。通过对机器学习引擎进行少量修改，我们当前的实现已经允许我们对代码段进行分类。将模型带到可以可靠地检测和分类嵌入式语言的阶段，并不是一件容易的事。
We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.
我们也正在考虑开放模型采购的可能性，如果您感兴趣的话，希望能收到社区的意见。

摘要 (Summary)

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code. If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter
借助OctoLingua，我们的目标是提供一种服务，该功能可在从文件级或代码段级到潜在的行级语言检测和分类的多个粒度级别上实现可靠可靠的源代码语言检测。最终，该服务可以支持代码可搜索性，代码共享，语言突出显示和差异呈现等所有功能，除了帮助他们编写高质量的代码外，还旨在为开发人员的日常开发工作提供支持。如果您有兴趣利用我们的工作或为我们的工作做出贡献，请随时与Twitter @github!@github联系！