Stata 16 introduces tight integration with Python allowing users to embed and execute Python code from within Stata. In this talk, I will demonstrate new functionality we have been working on: calling Stata from within Python. Note that this functionality is not available yet and is still a work in progress.
The Python package pystata provides two ways to interact with Stata:
The magic commands can be used to access Stata and Mata interactively in an IPython kernel-based environment.
The API functions can be used to interact with Stata and Mata in a command-line Python environment.
The API functions can also be used together with the magic commands in the IPython environment. Both of them can be used with Stata's Function Interface (sfi) module to access Stata and Mata.
In Python, with this integration, you can now:
To get started, we need to configure the pystata package within Python so that it can be found and imported by Python. Suppose we have Stata installed in C:/Program Files/Stata/, it then can be initialized in Python as follows:
import sys
sys.path.append("C:/Program Files/Stata/utilities")
from pystata import config
config.init()
To illustrate the general usage of calling Stata from Python, we use the automobile data. The data has mileage rating and weight of 74 automobiles. The variables of interest in the data are mpg, weight, and foreign. The foreign variable assumes the value 1 for foreign and 0 for domestic automobiles. We wish to analysis the relationship among the mileage rating, weight, and whether the automobile is foreign or domestic.
%%stata
use https://www.stata-press.com/data/r16/auto, clear
describe
Then we obtain summaries of mpg and weight for the foreign and domestic cars.
%stata by foreign: summarize mpg weight
We visualize mpg and weight for each group of cars using the scatter plot.
%%stata
scatter mpg weight, by(foreign, total)
Then we fit a linear regression model of mpg on weight and foreign.
%%stata
regress mpg weight i.foreign
Next, we use margins to calculate the mean predicted values for various values of weight in increments of 1,000 between 2,000 and 5,000 and each group of cars. We then use marginsplot to show the results graphically.
%%stata
margins, at(weight=(2000(1000)5000)) over(foreign)
marginsplot, by(foreign) xlabel(, angle(forty_five))
Last, we create a partial-regression leverage plot for all the regressors using the avplots command.
%%stata
set scheme plottigblind
avplots
There are many ways to load data from Python into Stata's current dataset in memory. For example
We have data from the National Longitudinal Survey on young women’s wages reported from 1968 through 1988. This dataset is stored in a csv file named nlswork.csv.
The goal is to use Stata to fit a model of wage as a function of each woman’s age, job tenure, birth year, and level of education. We believe that random shocks that affect a woman’s wage also affect her job tenure, so we treat tenure as endogenous. As additional instruments, we use her union status, number of weeks worked in the past year, and a dummy indicating whether she lives in a metropolitan area.
The plan is to load the data using Pandas dataframe and fit a single-equation instrumental-variables regression via the ivregress command.
import pandas as pd
nlswork = pd.read_csv('nlswork.csv')
nlswork
Then we load the dataframe into Stata as current dataset and specify the labels to the variables of interest within Stata.
%%stata -d nlswork -force
label variable ln_wage "ln(wage/GNP deflator)"
label variable age "age in current year"
label variable birth_yr "birth year"
label variable grade "current grade completed"
label variable tenure "job tenure, in years"
label variable union "weeks unemployed last year"
label variable wks_work "weeks worked last year"
label variable msp "1 if married, spouse present"
describe
Next, we fit the model and push Stata's estimation results into Python, such as the coefficent vector e(b) and variance-covariance matrix e(V). The estimation results is stored in steret, which is a Python dictionary.
%%stata -eret steret
// fit the model using the gmm estimator
ivregress gmm ln_wage age c.age#c.age birth_yr grade ///
(tenure = union wks_work msp), wmatrix(cluster idcode)
// e() results
ereturn list
Below, we list the entire contents of the Python dictionary steret.
steret
You can also access specific elements of the dictionary. For example, you can access e(b) and e(V) by typing steret['e(b)'] and steret['e(V)'] in Python.
steret['e(V)']
You can also access the above matrix using the get() method of the Matrix class in the sfi module.
from sfi import Matrix
import numpy as np
np.array(Matrix.get('e(V)'))
Stata 16 introduced frames, allowing you to simultaneously work with multiple datasets in memory. The following example illustrates simple usage of multiple frames in Stata and how to switch frames between Stata and Python.
First, we load the iris datateset in Stata. This dataset is used in Fisher's (1936) article. Fisher obtained the iris data from Anderson (1935). The data consist of four features measured on 50 samples from each of three Iris species.
%%stata
use http://www.stata-press.com/data/r16/iris, clear
describe
label list species
Our goal is to build a classifier using those four features to detect the Iris type. We will use the Random Forest classification model within the scikit-learn Python package to achieve this goal. First, we split the data into the training and test datasets, and store the datasets in the training and test frames in Stata. The training dataset contains 80% of the observations and the test dataset contains 20% of the observations. Then we store the two frames into Python as Pandas dataframes with the same name.
%%stata -fouta training,test
// Split the original dataset into training and test
// dataset which contains 80% and 20% of observations respectively
splitsample, generate(svar, replace) split(0.8 0.2) show rseed(16)
// create two frames holding the the two datasets
frame put iris seplen sepwid petlen petwid if svar==1, into(training)
frame put iris seplen sepwid petlen petwid if svar==2, into(test)
Now we have 2 NumPy arrays in Python, training and test. Below we split each array into two sub-arrays to store the features and labels separately.
X_train = training[:, 1:]
y_train = training[:, 0]
X_test = test[:, 1:]
y_test = test[:, 0]
Then we use X_train and y_train to train the classification model.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=16)
clf.fit(X_train, y_train)
Next we use X_test and y_test to evaluate the performace of the training model. We also predict the species type of each flower and the probabilities that it belongs to the three species in the test dataset.
from sklearn import metrics
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
Next, in the test frame, we create a Byte variable irispr to store the predicted species types and three float variables to store the probabilities that each flower belongs to the three species types from the array y_pred_prob.
from sfi import Frame
fr = Frame.connect('test')
fr.addVarByte('irispr')
fr.addVarFloat('pr1')
fr.addVarFloat('pr2')
fr.addVarFloat('pr3')
fr.store('irispr', None, y_pred)
fr.store('pr1 pr2 pr3', None, y_pred_prob)
In Stata, we change the current working frame to test. We attach the value label species to irispr, and use the tabulate command to display a classification table. We also list the flowers that have been misclassified.
%%stata
frame change test
label values irispr species
label variable irispr predicted
tabulate iris irispr, row
list iris irispr pr1 pr2 pr3 if iris!=irispr
This example will illustrate how to use the %%mata magic command to combine Python’s capabilities with features of Mata, Stata’s matrix programming language.
For illustrative purposes, we generate two random NumPy arrays in Python, X and y, representing the observations of the independent variables and the response variable. Our goal is to fit a linear regression and get the coefficients $b$, their standard errors, $t$ statistics and $p$-values.
import numpy as np
np.random.seed(16)
# 100,000 observations and 7 variables
X = np.random.random((100000, 7))
y = np.random.random((100000, 1))
Then we push X and y to Mata using the magic command %%mata, and fit the model.
%%mata -m X,y
//add the constant term to X
cons = J(100000,1,1)
X = (X, cons)
// calculate b
b = invsym(X'X)*X'y
// calcuate residuals and square of s
e = y - X*b
n = rows(X)
k = cols(X)
s2 = (e'e)/(n-k)
// calculate V
V = s2*invsym(X'X)
Next we calculate the standard errors, $t$ statistics and $p$-values. Note that the standard errors are just the square roots of the diagonals of V.
%%mata
se = sqrt(diagonal(V))
tstat = b:/se
pval = 2*ttail(n-k, abs(b:/se))
Next, we store the above results into a Mata matrix ols_est and push it back to Python.
%%mata -outm ols_est
ols_est = (b, se, tstat, pval)
ols_est
In Python, we now have a NumPy array named ols_est with the same contents as the above Mata matrix.
ols_est
We can compare ols_est with the results reported by Stata's regress command. Below, we will use the %%stata magic command to execute it. Before we do that, we combine X and y into one NumPy array called data.
data = np.concatenate((X, y), axis=1)
Then we push data to Stata. When reading a NumPy array into Stata, the variables are named v1, v2, ... by default, so we rename them. Then we fit the regression model.
%%stata -d data -force
describe
// rename the last variable to y
rename v8 y
// rename the other variables, prefixing them with an x
rename v* x*
regress y x1-x7
In addition to the magic commands, you can also call Stata using the API functions defined in the stata module. For illustration purpose, we use the Boston housing price dataset from the scikit-learn package.
from sklearn import datasets
bos = datasets.load_boston()
print(bos.DESCR)
We store the dataset into a Pandas dataframe named boston.
import pandas as pd
boston = pd.DataFrame(bos.data)
boston.columns = bos.feature_names
boston['MEDV'] = bos.target
boston.head()
Then we load the Pandas dataframe into Stata using the dataframe_to_data() function of the stata module.
from pystata import stata
stata.dataframe_to_data(boston, force=True)
Next, we specify the variable labels and attach a value label to CHAS, which indicates whether the tract bounds the Charles River, using the run() method.
stata.run('''
label variable MEDV "Median value of owner-occupied homes in $1000s"
label variable CRIM "Town crime rate, per capita"
label variable RM "Average number of rooms per dwelling"
label variable CHAS "Tract bounds the Charles River (= 1 if tract bounds river; 0 otherwise)"
label define bound 1 "Yes" 0 "No", replace
label values CHAS bound
''')
Afterwards, we fit a linear regression model of the median home value (MEDV) on the town crime rate (CRIM), the average number of rooms per dwelling (RM), and whether the house is close to the Charles River (CHAS). Then we predict the median home values, storing these values in the variable midval.
stata.run('''
regress MEDV CRIM RM i.CHAS
predict midval
''')
Then we use the get_ereturn() function to push all the e() results to Python as a dictionary named steret. And we use the data_to_array() function to store the predicted values (midval) in a NumPy array named stpred.
steret = stata.get_ereturn()
stpred = stata.data_to_array('midval')