# 
2. Hash

By [rojeck](https://paragraph.com/@rojeckluo) · 2021-12-24

---

    Using hash to assign randomly, a given row can have the same origin when it is salted. Locally, this disperses the grip between RegionServers and allows prediction during read operations. The deterministic hash (deterministic hash) can reconstruct the complete row key on the client side, and can use the get operation to directly obtain the desired row.
    
    For example, after processing the above three row keys (unsalted), here we use the MD5 hash algorithm, and the results are as follows
    
    f18a79a8eb39267173fd0d113e3282f4
    
    277ba32a1610268cdb7733192010c127
    
    bd1258481401ea1be6377c5aaeae3a1f
    

!\[In the 16 partitions in the above figure, the data basically achieves a balanced distribution. When the amount of data becomes larger and larger, the balance of the partitions will be more balanced. The two commonly used rowkey design methods introduced above. Since data storage in HBase is in the form of kv, if the same rowkey (except for the built-in version) is inserted into the same column of the same table in HBase, the original data will be overwritten, so in order to ensure the uniqueness of the rowkey, in the actual design We may combine multiple design methods to achieve the maximum optimization design of rowkey, such as MD5+salt.

Data reading of HBase mainly uses rowkey to query and obtain data. We can use Hive, Phoenix, etc. to obtain data through SQL, or store index data in Solr, ElestaticSearch, and query data indirectly, but this method is not passed. Rowkey has high performance for directly querying data, but the real-time performance is not high, so we should include more available information when designing rowkey. Here we refer to the design rules mentioned in the book "The Road to Big Data-Alibaba Big Data Practice"\]([https://images.mirror-media.xyz/publication-images/LSQwjcetdK9Fi0u1d26y7.png?height=357&width=549](https://images.mirror-media.xyz/publication-images/LSQwjcetdK9Fi0u1d26y7.png?height=357&width=549))

Design rule: MD5+main dimension+dimension identifier+sub-dimension 1+time dimension+sub-dimension 2

For example: MD5 top four digits of seller ID + seller ID + app + first-level catalog + date + second-level catalog

Using the first four digits of MD5 as the first part of the rowkey, the data can be hashed to balance the server load and avoid hot issues. The seller ID is required for query, which narrows the scope of our scan and can query the data we need faster.

Finally, the length of the rowkey will also affect our performance. Since HBase is a columnar database, if the length of the rowkey is doubled, the overall storage capacity will increase exponentially. Therefore, the design of rowkey does not have a fixed pattern. In actual practice, we need to refer to various factors to achieve the optimization of rowkey.

---

*Originally published on [rojeck](https://paragraph.com/@rojeckluo/2-hash)*
