Carte

Photo by Luke Chesser

R for Stats and Visualizations


data-science

R is an extremely popular environment amongst statisticians, data analysts, and computer science. From the beginnings of a project between two colleagues interested in statistical computing to one of the most popular programming environments, R has evolved to being used not only in academia but in business settings. Even on its initial release, R was adopted quickly by statisticians and engineers without programming knowledge. R is capable of being run in numerous environments and is known for its capabilities of not only performing statistical calculations but also for developing graphics in a flexible manner. R has quickly become a standard in both industry and academia for computation and visualizations. It is also capable of being used for algorithms in Machine Learning and distributed computing systems.

R’s History

Two colleagues from the Statistics Department at the University of Auckland in New Zealand, Ross Ihaka and Robert Gentleman, saw a need for a better software environment. Both had interest in statistical computation and desired functionalities like that of Scheme, and they began work on a small Scheme-like interpreter for statistics computations. The consisting of around 1000 lines of C code, the initial interpreter provided a good deal of R’s functionality.15 With their familiarity of S, they choice to use an S-like syntax and according to Ihaka, “this decision, more than anything else, has driven the direction that R development has taken.”15

Open Sourcing

In August of 1993, after an initial version of R was implemented, Ihaka and Gentlemen placed some binary copies on Statlib, a system for distributing statistical software, datasets, and information. Several people picked up the binaries and provided feedback to Ihaka and Gentleman.15 Martin Mächler of ETH Zurich persisted that the two of them release R as “free software” and, in June of 1995, Ihaka and Gentlemen released R under the GPL. By 1997, a larger “Core Group” was integrated into the main development of R and the Comprehensive R Archive Network (CRAN) was established. Since 1997, write access to the R source code has been limited to the R Core Team and bug reports are accepted from the users of R via email.13

The Language

The R language is a statistical computation environment licensed under the GNU General Public License (GPL) with syntax based on the S language, another high-level statistical computation language.9 The R environment also allows interfacing with procedures written in C, C++, or FORTRAN. The base distribution of R contains functionality for many statistical procedures as well as access to numerous packages through CRAN.9

Differences Between R and S

While the syntax of R is heavily influenced by S, two fundamental differences exist between R and S as a result of R’s Scheme heritage, the use of a garbage collector for memory management and lexical scoping.15

Memory Management

On startup, R allocates a block of memory and uses a garbage collection strategy to manage internal objects. At the end of a session, the current objects within R’s memory can be saved.3

Lexical Scoping

Variables within R are lexically scoped.9 Because of this, variables can be accessed within the context in which they are defined.

cube <- function(n) {
    sq <- function() n * n
    return n * sq()
}

Environments

R consists of the R language, a run-time environment, and debugger.15 R can be run in Read-Eval-Print-Loop (REPL) mode or files can be written and called later. R Programs can be written and supports multiple third-party environments such as, Jupyter Notebooks, RStudio, Apache Spark, and R Tools for Visual Studio (RTVS).

Jupyter Notebooks

Named for the core supported languages Julia, Python, and R, Jupyter Notebooks are an opensource web application that allows the user to create and store documents containing live code, equations, visualizations, and text.12 Jupyter Notebooks are also gaining more widespread adoption outside of statistic and computer science curricula due to the proliferation of large data sets.6

Apache Spark

Apache Spark is a cluster computing framework for large-scale data processing which focuses on efficiency and fault tolerance.7 SparkR is an R package introduced to allow R to integrate with the Apache Spark Unified Analytic Engine. SparkR also supports functionality for distributed machine learning using the MLlib package.

RStudio

RStudio is an Integrated Development Environment (IDE) for R consisting of a console and an editor supporting direct code execution as well as tools for plotting, history, debugging, and workspace management.16

R Tools for Visual Studio

In Visual Studio 2017, Microsoft introduced the R Tools for Visual Studio (RTVS) giving data analysts access to the tooling for developing R within Visual Studio in addition to Microsoft R Open, Microsoft’s distribution of R and the Microsoft R Server libraries for accelerated computation on datasets that do not fit into a system memory.5

Packages and Graphics

Early in the life cycle of R, Ihaka started experimenting with R’s graphics. Early experimentation involved R’s color models and line textures. R uses a device independent 24-bit model for colors capable of operating in Hexadecimal, X Windows, and S-Compatible modes.15 In addition to its multiple color operation modes, R is capable of rendering lines themselves in several ways. Line textures can be defined through common names (e.g., dotted, dashed, etc.).15 Line textures can also be defined as a string containing segments lengths. For example, an up/down segment definition of 52 translates to a pen down of 5 points (or pixels) followed by a pen up of 2 points (or pixels).15 Like the color modes, R also has an S compatible mode for line rendering allowing the textures to be defined through an index to a set of line types.15 R’s functionality can also be extended with several software packages allowing additional functionality for almost anything from Markdown, ODBC integration, or visualization utilities. R packages can add the functionality of static visualizations through packages such as ggplot2 as well as the development of dynamic visualizations through packages like leaflet.5 R is known industry-wide for its graphics rendering capabilities.

Popularity

In 2020, the Institute of Electrical and Electronics Engineers (IEEE) has ranked the R Programming Language as the number 6 programming language.4 In addition to this software professionals that responded to the 2020 Stack Overflow Survey of places R as one of the most loved languages available.17 The R language has become so popular with graduate-level students that Max Khun, the Associate Director of Nonclinical Statistics at Pfizer says, “R has become a second language for people coming out of grad school now, and there’s an amazing amount of code being written for it.”2

Applications

R has found a devoted following of statisticians, engineers, and scientists without computer programming skills, and organizations like Google and Pfizer are using R for many additional workloads.2 In addition to professional workloads, R is very common in academic research and has been used to produce algorithms for text mining1, machine learning with Support Vector Machines (SVM)11, and for the design and test of digital logic DNA systems.14 In 2019, de Lima et al developed an algorithm with R to conduct patent analysis in order the identify the stage of technological development of photovoltaic panels. The team’s study found an increase in patent deposits resulting in increased availability of lower cost and more efficient panels.1 In 2021, Sanghar et al used R to implement a Support Vector Machine (SVM) in order to predict diabetes. An SVM is a supervised machine learning model allowing for the accomplishment of tasks through machine training. An SVM is most used for classification and linear regression. 11 In 2021, Marks et al began looking at a new method of circuit design using synthetic DNA molecules as the substrate.The team extended an existing R package, DNAr, to aid in the construction of DNA strand displacement (DSD) systems to build circuits from synthetic DNA molecules. The aim of this extended package was to make design and simulation of these DSD circuits to be more accessible.14

Wrapping Up

Though it started as a very small project with two main developers, in the 30 years after its introduction R has gained vast popularity. Almost immediately, statisticians and non-programming engineers began using R for statistical computations and analysis. It quickly rose to favor within academics and has become a standard tool for data analysts and data scientists in the workforce. In addition to being an incredibly robust statistics environment, R can create and customize visualizations to represent the outcomes of these calculations. R has recently become one of the top desired languages amongst professional developers.

References

  1. Bogacz, M. (2021, March 1). COVID-19: Continents in relation to time. Kaggle. https://www.kaggle.com/michau96/covid-19-continents-in-relation-to-time/output
  2. de Lima, A., Argenta, A., Zattar, I., & Kleina, M. (2019). Applying Text Mining to Identify Photovoltaic Technologies. IEEE Latin America Transactions, 17(05), 727–733. https://doi.org/10.1109/tla.2019.8891940
  3. Hildebrandt, K., Panse, F., Wilcke, N., & Ritter, N. (2020). Large-Scale Data Pollution with Apache Spark. IEEE Transactions on Big Data, 6(2), 396–411. https://doi.org/10.1109/tbdata.2016.2637378
  4. Hornik, K. (1997, April 23). ANNOUNCE: CRAN. https://stat.ethz.ch/pipermail/r-announce/1997/000001.html
  5. Hornik, K. (2020, February). R FAQ. https://cran.r-project.org/doc/FAQ/R-FAQ.html
  6. IEEE. (2020). IEEE Spectrum. IEEE Spectrum: Technology, Engineering, and Science News. https://spectrum.ieee.org/
  7. Ihaka, R. (1998, May). Past and Future History. R. https://cran.r-project.org/doc/html/interface98-paper/paper.html
  8. Lam, J. (2019, March 20). Introducing R Tools for Visual Studio. Visual Studio Blog. https://devblogs.microsoft.com/visualstudio/introducing-r-tools-for-visual-studio-3/
  9. Marks, R. A., Vieira, D. K., Guterres, M. V., Oliveira, P. A., Fonte Boa, M. C., & Vilela Neto, O. P. (2021). Design and Test of Digital Logic DNA Systems. IEEE Design & Test, 1–1. https://doi.org/10.1109/mdat.2021.3069369
  10. Project Jupyter. (n.d.). Project Jupyter. https://jupyter.org/
  11. The R Foundation. (n.d.). R Project Contributors. R Contributors. https://www.r-project.org/contributors.html
  12. Reades, J. (2020). Teaching on Jupyter. REGION, 7(1), 21–34. https://doi.org/10.18335/region.v7i1.282
  13. RStudip. (2021). RStudio. https://www.rstudio.com/products/rstudio/
  14. Sanghar, M., Shukla, V. K., Verma, A., & Sharma, P. (2021). Implementation of Support Vector Machines Algorithm through R-Language for Diabetes Database Testing. 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). https://doi.org/10.1109/confluence51648.2021.9377124
  15. Stack Overflow. (2021). Stack Overflow Developer Survey 2020. Stack Overflow Developer Survey. https://insights.stackoverflow.com/survey/2020
  16. Vance, A. (2009, January 7). Data Analysts Captivated by R’s Power. The New York Times. https://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
  17. Varghese, L. S. (2021, April 9). Netflix Shows in R_CW. Kaggle. https://www.kaggle.com/lisasvarghese2037041/netflix-shows-in-r-cw
  18. Venables, B., Smtih, D., & The R Core Team. (2021, April 1). An Introduction to R. Comprehensive R Archive Network. https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf