What is the meaning of “E-value” in the BLAST search?

What is the meaning of “E-value” in the BLAST search?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

After reading many pages, I still do not understand the definition. Can someone use simple words to explain me that? This expectation or expect value "E" (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results.

If a lower E value means closer results to the query sequence, what is its difference to "Sequence Identity Cutoff"?

You can find the definition here:

the number of hits one can "expect" to see by chance when searching a database of a particular size.

Also read some of the background information given to understand the meaning:

It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course. The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

If you want to understand the calculation I would highly recommend to read the BLAST course

Identity is not the same as the E value I would try to explain this as easy as possible so let's suppose a sequence:

query sequence: ATGCCGTGC

When you BLAST this (of course your sequence is much longer), you can get the sequence ATGCCGTGC, which will result in 100% identity (NOTE most of the time when you have 100% identity, you have the right sequence but I just picked 100% to make the difference clear. When you have lower percentages of identity the E value is much more important. So when you have 70% identity in different matches e.g. it is important to look a the E value to decide which match is the "right" one for your purpose), now you probably would think that's nice that's the sequence I was looking for. However this would be a mistake! Maybe this sequence is present many times in the database and this is just a random match, this is the point were the E value comes in. Looking at the E value this would be pretty high because you have a short sequence and the chance of matching this in a big database is really high so you can get an E value of let's say: 10. What this means is that you have a chance of finding 10 matches by chance when searching a database of a the current size in other words this match would probably be random! So it's much better to conclude if a match is "significant" based on the E value cut-off than on the identity. Further the choice of the E value is completely yours as said by WYSIWYG♦ as answer to a cut-off question:

E-value refers to the expected number of random hits for a given alignment score. Smaller it is more reliable is your match. There is no hard and fast rule for e-value cutoff. You can keep whatever you want depending on the level of stringency that you require. But you should note that for smaller sequences (< 30nt) there is always a higher likelihood of random matches. In such cases it is practical to relax the e-value cutoff.

Short simple example Both of these sequences got the same identity but a different E value. What does this mean? Both of the sequences match the same amount(57%) of the amino acids however based on the E value you can say just as you said that the first match is less likely to be a false positive. I want to note one thing: Always take a look at the biological significance of your result. E.g. you extract proteins from a red blood cell and find proteins with a high E value for red blood cell proteins and a low E value for proteins from nerve cells. It's clear that despite the high E value the red blood cells are probably more biologically significant.