Shopee - Price Match Guarantee

Link : https://www.kaggle.com/competitions/spaceship-titanic/data

Dataset Description

 Finding near-duplicates in large datasets is an important problem for many online businesses. In Shopee’s case, everyday users can upload their own images and write their own product descriptions, adding an extra layer of challenge. Your task is to identify which products have been posted repeatedly. The differences between related products may be subtle while photos of identical products may be wildly different!

 As this is a code competition, only the first few rows/images of the test set are published; the remainder are only available to your notebook when it is submitted. Expect to find roughly 70,000 images in the hidden test set. The few test rows and images that are provided are intended to illustrate the hidden test set format and folder structure.

File and Data Field Descriptions

[train/test].csv - the training set metadata. Each row contains the data for a single posting. multiple postings might have the exact same image ID, but with different titles or vice versa.

  • posting-id - the Id code for posting.

  • image - the image id/md5sum.

  • image_phash - a perceptual hash of the image.

  • title - the product description for the posting.

  • label_group - ID code for all postings that map to the same product. not provided for the test set.

[train/test]images - the images associated with the postings.

sample_submission.csv - a sample submission file in the correct format.

  • posting_id - the id code for the posting.

  • matches - Space delimited list of all posting IDs that match this posting. posts always self-match. Group sizes were capped at 50, so there’s no need to predict more that 50 matches.