# DataFrame Column Descriptions This document serves as a guide for understanding the columns in the provided datasets. Each column is explained below: ## names - Represents the name of each entry. - "comp" refers to compounds from the Bordwell dataset. - "shen" refers to compounds from the Shen dataset. - "ibond" refers to compounds from the iBonD dataset. ## smiles - Contains SMILES notation for the neutral molecule. ## pka_exp - The experimental pKa value. ## ref - Reference for the experimental pKa from the Bordwell dataset. ## comment - Additional comments. ## method - The experimental method from which the pKa was obtained, information extracted from iBonD. ## ref_link - Link to the reference, information extracted from iBonD. ## ref_title - Title of the reference, information extracted from iBonD. ## name_shen - Name of the molecule as it appears in the Shen dataset. ## outlier - True if the entry is considered an outlier. Used to filter out compounds for LFER. ## outlier_note - Notes about why an entry is considered an outlier. ## lst_names_deprot - List of names for each deprotonation site. ## ref_mol_smiles_map - Mapped SMILES of the reference/neutral molecule. ## lst_smiles_map_deprot - List of mapped SMILES for each deprotonation site. ## lst_smiles_deprot - List of SMILES for each deprotonation site. ## lst_atomsite_deprot - List of atom sites for each deprotonation site. ## lst_atomindex_deprot - List of atom indices for each atom in the molecule. ## cm5 - List of CM5 charges for each atom in the molecule. ## descriptor_vector - List of descriptor vectors for each atom in the molecule. ## mapper_vector - List of mapper vectors for each atom in the molecule. ## train_test - Indicates if the molecule is part of the training set or test set. ## fold1 - fold5 - Fold assignments for cross-validation, with each fold representing a different subset of the data used for validation. ## gfn_method_xtb - Method used for the xTB calculation. ## solvent_model_xtb - Solvent model used for the xTB calculation. ## solvent_name - Name of the solvent. ## e_xtb_neutral - Energy for the neutral molecule calculated with xTB. ## e_dft_neutral - Energy for the neutral molecule calculated with DFT. ## lst_e_xtb_deprot - List of energies for each deprotonation site calculated with xTB. ## lst_e_rel_xtb - List of relative energies for each deprotonation site calculated with xTB. ## lst_e_dft_deprot - List of energies for each deprotonation site calculated with DFT. ## lst_e_rel_dft - List of relative energies for each deprotonation site calculated with DFT. ## e_rel_min_dft - Minimum relative energy calculated with DFT. ## lst_pka_lfer - pKa values calculated with LFER. ## pka_min_lfer - Minimum pKa value calculated with LFER. ## atomsite_min_lfer - Atom site for the lowest pKa value calculated with LFER. ## atom_lowest_lfer - atom_lowest_lfer_2 - Lists containing binary indicators for the lowest pKa values and those within an absolute difference of 1 or 2 from the lowest pKa value, calculated with LFER. ## lst_pka_pred_reg - atom_lowest_pred_reg2 - Similar to the LFER columns, but for pKa values predicted with a regression model (for both the test set and when trained with the full dataset). ## lst_pka_pred_reg_full - atom_lowest_pred_reg_full2 - Lists and indicators for pKa predictions using regression models trained with the full dataset, including minimum values and proximity to the lowest predicted values.