Importing data with genfromtxt ============================== https://numpy.org/doc/2.3/user/basics.io.genfromtxt.html NumPy provides several functions to create arrays from tabular data. We focus here on the genfromtxt function. In a nutshell, genfromtxt runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop, but gives more flexibility. In particular, genfromtxt is able to take missing data into account, when other faster and simpler functions like loadtxt cannot. Note When giving examples, we will use the following conventions: import numpy as np from io import StringIO Defining the input The only mandatory argument of genfromtxt is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or io.StringIO object. If a single string is provided, it is assumed to be the name of a local or remote file. If a list of strings or a generator returning strings is provided, each string is treated as one line in a file. When the URL of a remote file is passed, the file is automatically downloaded to the current directory and opened. Recognized file types are text files and archives. Currently, the function recognizes gzip and bz2 (bzip2) archives. The type of the archive is determined from the extension of the file: if the filename ends with '.gz', a gzip archive is expected; if it ends with 'bz2', a bzip2 archive is assumed. Splitting the lines into columns The delimiter argument Once the file is defined and open for reading, genfromtxt splits each non-empty line into a sequence of strings. Empty or commented lines are just skipped. The delimiter keyword is used to define how the splitting should take place. Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma (,) or a semicolon (;) as delimiter: data = "1, 2, 3\n4, 5, 6" np.genfromtxt(StringIO(data), delimiter=",") array([[1., 2., 3.], [4., 5., 6.]]) Another common separator is "\t", the tabulation character. However, we are not limited to a single character, any string will do. By default, genfromtxt assumes delimiter=None, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space. Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. In that case, we need to set delimiter to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes): data = " 1 2 3\n 4 5 67\n890123 4" np.genfromtxt(StringIO(data), delimiter=3) array([[ 1., 2., 3.], [ 4., 5., 67.], [890., 123., 4.]]) data = "123456789\n 4 7 9\n 4567 9" np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)) array([[1234., 567., 89.], [ 4., 7., 9.], [ 4., 567., 9.]]) The autostrip argument By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. This behavior can be overwritten by setting the optional argument autostrip to a value of True: data = "1, abc , 2\n 3, xxx, 4" # Without autostrip np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5") array([['1', ' abc ', ' 2'], ['3', ' xxx', ' 4']], dtype=',<. excludelist Gives a list of the names to exclude, such as return, file, print… If one of the input name is part of this list, an underscore character ('_') will be appended to it. case_sensitive Whether the names should be case-sensitive (case_sensitive=True), converted to upper case (case_sensitive=False or case_sensitive='upper') or to lower case (case_sensitive='lower'). Tweaking the conversion The converters argument Usually, defining a dtype is sufficient to define how the sequence of strings must be converted. However, some additional control may sometimes be required. For example, we may want to make sure that a date in a format YYYY/MM/DD is converted to a datetime object, or that a string like xx% is properly converted to a float between 0 and 1. In such cases, we should define conversion functions with the converters arguments. The value of this argument is typically a dictionary with column indices or column names as keys and a conversion functions as values. These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type. In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1: convertfunc = lambda x: float(x.strip("%"))/100. data = "1, 2.3%, 45.\n6, 78.9%, 0" names = ("i", "p", "n") # General case ..... np.genfromtxt(StringIO(data), delimiter=",", names=names) array([(1., nan, 45.), (6., nan, 0.)], dtype=[('i', '