Analyzing a set of already assigned peak-lists for the protein S6, it is clear that there's more outliers than a normal distribution predicts, which justifies a deeper analysis of the data. However even if I assume that the error for e.g. all amid protons in the HNCA spectrum follows the same distribution, the samples variance might be different for the HNcaCO experiment.
A full analysis leaves me in this case with 45 different samples that most likely follow the same kind of distribution with differing parameters. So the question was how to continue from here. Luckily I found this blog post.
What the author did was to brute force all 80 distributions that scipy.stats offers and list the p and D values from a Kolmogorov-Smirnov test for a single sample. I ended up improving the code to support multiple samples, as well as displaying the log likelihood of the fit. (Any distribution that fits extremely poorly with a single sample is removed from the output)
The code can be found here: distribution-check.py
If all your samples are in the folder samples/ with the extension .txt, then simply run
python distribution-check.py samples/*.txt
As an example, here's the output for running the script on my 45 samples:
Apparently a normal distribution is not a very good 2-parameter representation of my samples, even though the simplicity of it might still be beneficial, so I will probably end up testing a few others. The -inf values probably occurs due to asymptotic behaviour at the mean and I would suggest just ignoring them, and use the KS test instead to determine if its a good fit
No comments:
Post a Comment