mle_utils_nd.py
- InputData(ListofDictionaries)
Compile a dictionary of the data and metadata given a list of dictionaries.
- Parameters:
ListofDictionaries (list[dict]) – A list of dictionaries of length ‘d’, where each dictionary corresponds to a dimension of the data. For example,
ListofDictionaries = [RadiusDict, MassDict,...].
The input list should contain dictionaries, each of which contains the following fields:
Data: Data (observables) that are modelled assuming an asymmetric normal distribution. Length ‘L’. The normal distribution consists of two half normals, unnormalized, where the observed measurement sits at the 50% percentile
LSigma: Sigma value for the lower normal distribution. Same scale as Data (log/linear). Length ‘L’
USigma: Sigma value for the upper normal distribution. Same scale as Data (log/linear). Length ‘L’
Max: log10 of the upper bound for the dimension to be fit. Can leave as np.nan, in which case = np.log10(1.1*np.max(ndim_data[d]))
Min: log10 of the lower bound for the dimension to be fit. Can leave as np.nan, in which case = np.log10(0.9*np.min(ndim_data[d]))
Label: Axes label (string) to be used for this dimension. E.g. ‘Radius ($R_{oplus}$)’ or ‘Pl Insol ($S_{oplus}$)’
Char: Symbol (character) to be used for this dimension. E.g. ‘r’ or ‘s’
- Returns:
DataDict – A dictionary containing all of the data and metadata.
- Return type:
dict
The output dictionary
DataDictcontains the following fields:ndim_data: A 2-d array of size (d = number of dimensions, L = number of data points) containing the data.
ndim_sigma: A 2-d array of size (d, L) containing the uncertainties of each data point (assuming symmetric error bars, taken as the average of the upper and lower uncertainties).
ndim_LSigma: A 2-d array of size (d, L) containing the lower uncertainties of each data point.
ndim_USigma: A 2-d array of size (d, L) containing the upper uncertainties of each data point.
ndim_bounds: A 2-d array of size (d, L) of size (number of dimensions, 2) containing the bounds of the data in each dimension.
ndim_char: A list of character strings representing the variable in each dimension.
ndim_label: A list of character strings for labeling the variable in each dimension.
ndim: The number of dimensions.
DataLength: The number of data points.
DataSequence: A 2-d array of size (d, 50) with a uniform sequence for each dimension between the lower and upper bounds. Note: This is uniform in log10-space since the bounds are in log10 space.
- MLE_fit(DataDict, deg_per_dim, OutputWeightsOnly=False, CalculateJointDist=False, save_path=None, verbose=2, UseSparseMatrix=False)
Perform maximum likelihood estimation to find the weights for the beta density basis functions. Also, use those weights to calculate the conditional density distributions.
See Ning et al. 2018, Sec 2.2, Eq 9.
- Parameters:
DataDict (dict) – A dictionary containing the data, as returned by
InputData().deg_per_dim (array[int]) – The number of degrees per dimension.
OutputWeightsOnly (bool, default=False) – Whether to only output the (unpadded) weights (and log likelihood).
CalculateJointDist (bool, default=False) – Whether to also calculate the join distribution.
save_path (str, optional) – The folder name (including path) to save results in.
verbose ({0,1,2}, default=2) – Integer specifying verbosity for logging: 0 (will not log in the log file or print statements), 1 (will write log file only), or 2 (will write log file and print statements).
UseSparseMatrix (bool, default=True) – Whether to use a sparse matrix for the C_pdf calculation (to reduce memory for large problem sizes).
- Returns:
outputs (dict) – The dictionary containing the outputs, if
OutputWeightsOnly=False.unpadded_weight (array[float]) – A 1-d array of unpadded weights (of length = product(deg_i - 2) where i in indexes through the dimensions), if
OutputWeightsOnly=True.n_log_lik (array[float]) – A 1-d array of log likelihood values, if
OutputWeightsOnly=True.
The output dictionary
outputscontains the following fields:UnpaddedWeights: Same as the
unpadded_weightabove.Weights: A 1-d array of weights that has been padded with zeros around the edges of each dimension (i.e. the first and last element of each 1-d slice).
loglike: Same as the
n_log_likeabove.deg_per_dim: Same as the input.
DataSequence: The ‘DataSequence’ field of
DataDict; seeInputData().C_pdf: See the output of
calc_C_matrix().EffectiveDOF: The effective number of degrees of freedom (i.e. number of important weights) based on Kish’s effective sample size.
aic: The Akaike information criterion (based on the effective degrees of freedom).
aic_fi: The Akaike information criterion (based on the rank of the Fisher information matrix).
JointDist: The joint distribution (only returned if
CalculateJointDist=True); see the output ofCalculateJointDistribution().
- calc_C_matrix(DataDict, deg_per_dim, save_path, verbose, SaveCMatrix=False, UseSparseMatrix=False)
Integrate the product of the normal and beta distributions and then take the Kronecker product.
Refer to Ning et al. 2018 Sec 2.2 Eq 8 and 9.
- Parameters:
DataDict (dict) – A dictionary containing the data, as returned by
InputData().deg_per_dim (array[int]) – The number of degrees per dimension.
save_path (str) – The folder name (including path) to save results in.
verbose ({0,1,2}) – Integer specifying verbosity for logging (see
utils_nd._logging()).SaveCMatrix (bool, default=False) – Whether to save the C_pdf to a text file.
UseSparseMatrix (bool, default=False) – Whether to use a sparse matrix for the C_pdf calculation (to reduce memory for large problem sizes).
- Returns:
C_pdf – A 2-d matrix (or ‘lil_matrix’, if UseSparseMatrix=True) of shape = (N x product(d-2)), where ‘N’ is the number of data points and ‘d’ is the number of degrees in each dimension. For example in two dimensions this would be (N x ((d1-2)x(d2-2))).
- Return type:
array[float] or lil_matrix
- InvertHalfNormalPDF(x, p, loc=0.0)
Invert the (half-)normal PDF to find the standard deviation that gives ‘1 - (1-p/2)’ as the cumulative probability at ‘x’ (i.e., P(> x) = 1-p).
For example, if you care about 99.7% upper limits, set p=0.997. The function will invert the normal distribution to find the quantile at ‘1 - (1-0.997/2) = 0.9985’, because the upper limits are being approximated as a half-normal, centered at
loc=0.- Parameters:
x (float) – The point at which ‘P(>x) = 1-p’ for the normal distribution.
p (float) – The probability of being less than
xfor the normal distribution. Must be between 0 and 1.loc (float, default=0.) – The location (mean) of the normal distribution.
- Returns:
scale – The scale (standard deviation) of the normal distribution that gives probability
1-p/2abovex.- Return type:
float
- IntegrateNormalBeta(data, data_Sigma, deg, degree, a_max, a_min, Log=False)
Numerically integrate the product of the normal and Beta distributions.
Refer to Ning et al. 2018 Sec 2.2, Eq 8.
- Parameters:
data (float) – The observed data point.
data_Sigma (float) – The uncertainty (1-sigma) in the data point.
deg (float) – The degree parameters. Transforms to the shape parameters of the Beta distribution (see
_PDF_NormalBeta()) viashape1 = degree,shape2 = deg - degree + 1.degrees (float) – The degree parameters. Transforms to the shape parameters of the Beta distribution (see
_PDF_NormalBeta()) viashape1 = degree,shape2 = deg - degree + 1.a_max (float) – The maximum and minimum values defining the integration bounds.
a_min (float) – The maximum and minimum values defining the integration bounds.
Log (bool, default=False) – Whether to integrate in log space (for the normal distribution).
- Return type:
The integral of the product of the normal and Beta distributions.
- IntegrateDoubleHalfNormalBeta(data, data_USigma, data_LSigma, deg, degree, a_max, a_min, Log=False)
Numerically integrate the product of the two half normal and Beta distributions.
Includes the same parameters as
IntegrateNormalBeta`(), exceptdata_Sigma(the 1-sigma uncertainty in the data point) is replaced withdata_USigmaanddata_LSigma(the upper and lower 1-sigma uncertainties, respectively).- Return type:
The integral of the product of the two half normal and Beta distributions.
- calculate_conditional_distribution(ConditionString, DataDict, weights, deg_per_dim, JointDist, MeasurementDict)
Calculate a conditional distribution given a joint distribution.
- Parameters:
ConditionString (str) – A string specifying the dimensions to be conditioned on, e.g. in the form of ‘X|Z’ where X is conditioned on Z. More than one dimension can be conditioned on (e.g., ‘X,Y|Z’, ‘X|Y,Z’).
DataDict (dict) – The dictionary containing the data. See the output of
InputData().weights (array[float]) – A 1-d array of weights.
deg_per_dim (array[int]) – The number of degrees per dimension.
JointDist (array[float]) – The joint distribution.
MeasurementDict (dict) – A dictionary containing the data points in the conditioned dimensions (e.g., ‘Z’) at which to compute the conditional distributions, of the form ‘{‘Z’: [[z1,…], [[z1_l, z1_u],…]]}’ where z1 is a data point (log value) and z1_l, z1_u are the lower and upper uncertainties on z1 (can also be set to nan).
- Returns:
ConditionalDist (array[float]) – The conditional distributions at each point in MeasurementDict.
MeanPDF (array[float]) – The mean of the PDF at each point in MeasurementDict.
VariancePDF (array[float]) – The variance of the PDF at each point in MeasurementDict.
- CalculateJointDistribution(DataDict, weights, deg_per_dim, save_path, verbose)
Calculate the joint distribution in up to 4D.
Refer to Ning et al. 2018 Sec 2.1, Eq 7.
- Parameters:
DataDict (dict) – The dictionary containing the data. See the output of
InputData().weights (array[float]) – A 1-d array of weights.
deg_per_dim (array[int]) – The number of degrees per dimension.
save_path (str) – The folder name (including path) to save the logging outputs in.
verbose ({0,1,2}) – Integer specifying verbosity for logging (see
utils_nd._logging()).
- Returns:
Joint (array[float]) – The joint distribution.
indv_pdf_per_dim (list[array[float]]) – The individual PDF for each point in the sequence in
DataDict. This is a list of 1D vectors, where the number of vectors in the list is ‘ndim’ and the number of elements in each vector is the number of degrees for that dimension (given bydeg_per_dim).
- TensorMultiplication(A, B, Subscripts=None)
Calculate the tensor contraction of A and B.
- Parameters:
A (array[float]) – The n-dimensional arrays/tensors.
B (array[float]) – The n-dimensional arrays/tensors.
Subscripts (str, optional) – A string specifying the indices to contract, e.g. ‘ijk,klm -> ijlm’. If None, will calculate along the last dimension of
Aand the first dimension ofB, which must match in size. I.e.,C[i,j,...,l,...,n] = A[i,j,...,k] * B[k,l,...n].
- Return type:
The tensor product of
AandB.
Note
If dimension ‘n’ of the resulting tensor product is length 1, it is reshaped to reduce that dimension.
- rank_FI_matrix(C_pdf, w)
Compute the Rank of the Fisher Information Matrix as an estimate of the degrees of freedom in calculating the AIC.
- Parameters:
C_pdf (array[float]) – The 2-d array of as returned by
calc_C_matrix(), with dimensions [n, (deg-2)^2].w (array[float]) – The 2-d array of weights, with dimensions [deg-2, deg-2].
- Returns:
Rank – The rank of the Fisher Information matrix.
- Return type:
int
- NumericalIntegrate2D(xarray, yarray, Matrix, xlimits, ylimits)
Numerically integrate a 2-d array (e.g. a joint probability distribution, for normalization). First fits a bivariate spline to the 2-d grid of sampled points (
Matrix) before integrating.- Parameters:
xarray (array[float]) – The 1-d arrays defining the points along each dimension where the 2-d array
Matrixis sampled.yarray (array[float]) – The 1-d arrays defining the points along each dimension where the 2-d array
Matrixis sampled.Matrix (array[float]) – The 2-d array to be integrated over.
xlimits (tuple[float]) – The integration bounds (min, max) along each dimension.
ylimits (tuple[float]) – The integration bounds (min, max) along each dimension.
- Returns:
Integral – The integral over the
Matrixbetween the bounds given byxlimitsandylimits.- Return type:
float