GSOC'19 Tensorflow Datasets
Summary,
I have worked on the TensorFlow Datasets
with TensorFlow developers during my summer and working with such good developers was a great experience for me.
I worked both on my own objectives and on their ideas.
Objectives
- Kaggle downloader visual plugin to jupyter notebook. (#447)
- Fake data generator. (#785)
- Gsutil tool to downloading GCS dataset. (#627)
- Dataset auto-visualizing. (#714)
- More detailed documentation.
- Follow project issues, fix bugs, add requested datasets.
- If I still have free time after all these I can help my GSOC friends with their projects.
Note: Objectives were completed but were not merged for reasons independent of me.
Some Resources
TFDS GSOC Meeting Notes - We were in touch every day, we would do a video interview to tell each other what we were doing, to discuss ideas, and to read repo issues.
GSOC TIMELINE - With the advice of my mentor, I kept a weekly schedule that contained almost everything I did.
- Researching - I learned a lot of things I didn't know during my studies.
- Issues - I assigned, opened and managed a lot of issues.
- Docs - I have written documents for TFDS teams to present and discuss my opinion in detail.
- Pull Requests - I've worked on many big and small jobs.
- Reviews - I've reviewed new pull requests to fix bugs
-
Researching
- Jupyter Notebooks Extensions
- Gsutil
- TFRecords
- Data validation
- FACETS
- Data Visualization
- Auto-generation of fake data
- TFRecords
- Threading
- Requests: HTTP for Humans (python lib)
- Urllib3 (python lib)
-
Issues
- Better Sequence display in documentation #689
- Allow TFDS datasets defined outside TFDS #704
- Dataset Versioning #721
- day2night #752
- Direct onedrive url (for Resisc45) #631
- download mnist using python 3.7.4 #769
- Allow TFDS datasets defined outside TFDS #704
- Loading "super_glue/copa" fails #797
- Cannot download all datasets except mnist using command tfds.load(). #807
- Can't download IMDB dataset from Microsoft Windows #800
- [data request] binary-mnist #758
- Unit testing does not work on Windows #817
- Allow TFDS datasets defined outside TFDS #704
- Kaggle silently compresses some archives #844
-
Docs
-
Pull Requests
- Add kaggle downloader extension to jupyter notebook #447
- Add gsutil support to downloader. #627
- Add the missing links in index. #698
- Add a comment warning at the top of the generated .md files #697
- Expand Sequence list #692
- FACETS Visualization #714
- Delete unfound reference `from __future__ import google_type_annotations' #718
- Broken link on add_dataset.md #730
- Auto fake data generator#785
- Fix checksum input #786
- Add dataset-name to download_dir. #795
- Optimizing Imports with Pycharm #796
- Change
tf.io.gfile
instead of os functions #804
- Add tfds.add_checksums_dir #834
- Launch S3 on all text datasets. #864
- Update translate dataset to s3 #870
- Add usage of checksum_dir and EXAMPLE_DIR doc. #873
- Fix DatasetBuilderTestCase fake example dir path #879
- Fix broken link on README.md. #888
- Add versions to list builders. #897
- Add translate/wmt S3 Version #923
- Better Download Experience #921
-
Reviews
- coco2017 and coco2017_panoptic #716
- Read directly from archives on datasets with many records. #701
- mnist dataset falls back to its original website #713
- Implemented Mozilla Common Voice Dataset (Multilingual #161
- Added Stanford Online Products Dataset#238
- FACETS Visualization #714
- Expand Sequence list #692
- Improving documentation in text_feature.py #750
- Add PatchCAMELYON dataset #258
- mnist dataset falls back to its original website #713
- coco2017 and coco2017_panoptic #716
- Small fix for assert statement in Flores BuilderConfig #763
- Adding examples to DatasetBuilder.as_dataset() #784
- Use TFDV to compute statistics in TFDS #782
- mnist dataset falls back to its original website #713
- Added a new Dataset in translate directory #793
- Add Quickdraw Sketch RNN Dataset #361
- Added Dep of Tensorflow_io.lmdb to LSUN #468
- Add
show_examples(ds)
option for image datasets. #670
- Include __version__ in __all__ #798
- Add
show_examples(ds)
option for image datasets. #670
- add mini_imagenet dataset #263
- Add binarized_mnist to TFDS. #809
- add dataset: yelp_polarity_review #582
- Would like to contribute a new Dataset to translate directory #814
- Add AFLW2000-3D Dataset #359
- Added Mura Dataset #397
- Added dataset for Cars196 –Issue 202 #294
- Fixes #207: Added Support for Visual Dialog Dataset. #307
- Add Quickdraw Sketch RNN Dataset #361
- Added Deepweeds dataset #370
- Add initial unit test and test data #391
- Add lfw #379
- MIT Scene Parse 150 dataset #398
- Add
show_examples(ds)
option for image datasets. #670
- Added support for Cartoon Set #436
- Add Fruits360 #835
- add cifar-10.1 to tfds #839
- Add english-tamil parallel corpus to translate #818
- Fixes #317: Added Support for Stanford Question Answering Dataset - 2.0 #323
- add wider_face dataset #935
Future Work
I have a few work going on and I'm looking forward to future releases to continue with some of my PRs'.
Special Thanks
Working with TensorFlow developers at Tensorflow Datasets was a great experience for me. Assigning work tasks to me that were sometimes quite difficult allowed me to have the chance to expand my knowledge and build confidence in my abilities. I will never forget this summer and I will remember it with love.
Thank you {
Etienne Pot,
Pierre Ruyssen,
Marcin Michalski
}
and thank you to Google for giving us these great opportunities!
Recep Ahmet SARITEKIN