×
all 34 comments

[–]guepier 117 points118 points  (16 children)

I disagree with this incredibly strongly. I use Python extensively, and I mostly like it, but whenever I need to do data analysis I bend over backwards to avoid Pandas. Mostly this means using R instead. Pandas is nowhere near the state of the art of data analytics. Even Python has better libraries (namely, Polars). Pandas is atrociously slow and has a terrible API. — And to head off potential responses: I have used Pandas extensively, and I am absolutely qualified to judge its merits compared to other solutions.

So, no, I disagree with the premise: there are lots of reasons to learn Python, but Pandas is emphatically not one of them.

[–]unski_ukuli 23 points24 points  (0 children)

This. I vehemently hate how pandas like to throw stuff into the index. Polars is nice because it has no index, is fast and is logical in the api. Also, immutability is a nice addition.

[–]ZirePhiinix 20 points21 points  (1 child)

Pandas has really weird syntax that is nearly impossible to remember. There's randomly differently behaviors based on how the data is structured and I always have to Google like crazy to figure it out.

[–]Breadinator 3 points4 points  (0 children)

Thank you. The syntax is esoteric and drives me nuts sometimes. I sometimes have to guess WTH is happening by starting at the references made and working backwards on intent. Then pulling out my LLM assistant anyway and still asking it what the convoluted thing actually is.

[–]florinp 24 points25 points  (0 children)

Polars was written as a replacement for Pandas.

[–]QuickQuirk 5 points6 points  (0 children)

I've just started learning R, and am pleasantly surprised.

I mean, I like Python, in general, but R matches my preferences more. The things in Pandas that feel like they're bending over backwards to make work are a natural part of R itself. And it's also more on the functional language side of things, which I appreciate.

Helps that the tensorflow support seems pretty good these days too, for ML.

[–]SV-97 17 points18 points  (9 children)

AFAIK pandas has actually improved quite a bit with its most recent major release. I haven't checked it out yet since polars is just so good and I doubt that even this new version of pandas is as nice as polars; but I think it *is* substantially better than it used to be.

And personally I'd take even old pandas over R any day. The dev experience with R is just atrocious.

[–]huge_clock 2 points3 points  (0 children)

Pandas keeps getting worse for the simple things you want to use it for. I used to be able to take a data frame and go df.sum() and get the sum of each numerical category. Now they same operation will concatenate every string object in the data frame.

[–]guepier 12 points13 points  (7 children)

The dev experience with R is just atrocious

Yes, but the data analytics experience isn’t. R is miles ahead of Python in that space, and not just because of the libraries.

[–]SV-97 6 points7 points  (6 children)

If you only consider ecosystem size for your "data analytics experience" (and equate data analysis with mostly calculating statistics on some data): sure, for the most part that's true.

However when taking a more holistic view (i.e. setting up a dev environment in the first place, data extraction and cleaning, actually getting data in and out of the system, data exploration, writing core analyses and debugging those, publishing, ...) this isn't really true in my opinion. In my experience you end up wasting so much time dealing with all those pain points and idiosyncrasies around R that it's altogether faster to use Python and just implement the things that don't already exist yourself (although this is of course not viable for everyone) or interop with other languages for those parts.

And in particular when you don't do anything overly niche (as is really the case for what OP is talking about here) the python ecosystem is perfectly workable, and in some fields even miles ahead of R. For example for me a lot of data analysis involves optimization and some geometry processing / a bunch of maths. And for those python really has the strictly better ecosystem and larger community.

[–][deleted]  (4 children)

[deleted]

    [–]PillowFortressKing 1 point2 points  (0 children)

    Because doing iterative data analysis in a compiled language is even worse. Grabbing a Python package that's written in a high performance language gives you the best of both worlds.

    [–]youcangotohellgoto 4 points5 points  (2 children)

    If someone is worried about speed why are they reaching for python at all?

    Of course neither Pandas nor Polars are really "Python" - that's just the API to a C or Rust implementation.

    [–]guepier 1 point2 points  (0 children)

    I’m definitely not just considering the ecosystem size, I’m also considering ergonomics of the other aspects you mention. I agree that setting up a reproducible dev environment in R is frustrating. And “getting data in and out” of the system can be more convoluted than in Python, depending on the type of the data and storage and/or ingress/egress mechanism (e.g. JSON data, or data hosted on S3: botocore/s3fs is vastly better than anything R has to offer). But, honestly, in most cases it’s seamless.

    I disagree with the rest: data extraction, cleaning, exploration, core analysis and publishing are all things that R excels at. Troubleshooting is occasionally made harder due to the lack of any type safety, but type annotations are also much less helpful for data analysis than for most other software engineering applications. And interactive debugging (and, importantly, interactive exploration of data) works very well.

    And modern R IDE integration (be it via dedicated IDEs such as Positron or RStudio, or via plugins such as Nvim-R or ESS) provides best-in-class interactive data exploration REPLs, and these integrate very well with report generation via Quarto, which in many regards is also strictly superior to Jupyter (but if you prefer the latter, there is an R kernel for it).

    [–]ManySugar5156 2 points3 points  (0 children)

    same, pandas is usually the thing i avoid first. polars or r feels less annoying most of the time

    [–]HiPhish 26 points27 points  (2 children)

    Pandas has an atrociously un-pythonic API that makes me hate it to its core. I guess you have to use it if you are dealing with large amounts of data, but otherwise just give me regular lists and dicts. Pandas feels too much like "magic" where things just work until they don't. The documentation is pretty bad as well, it's as if you are meant to study the examples and then form a mental model of how the API works on your own. Oh, and good luck finding out what the data types are and dealing with Pandas's automatic type conversion.

    At least that was the case last time I had to use it. Maybe it has gotten better since, but I have no desire to come back.

    [–]squashed_fly_biscuit 7 points8 points  (1 child)

    Mainly because pandas is trying to be like R, which is a pretty weird language with strange norms written by and for scientists

    [–]WannaBeStatDev 2 points3 points  (0 children)

    At least R is good for science :)

    [–]billsil 31 points32 points  (0 children)

    I wrote a tool with straight numpy and it’s 50x faster than the pandas implemention. Pandas is severely overused and that’s before you start talking about polars, which is basically fast pandas.

    [–]RedEyed__ 4 points5 points  (0 children)

    pandaspolars

    [–]zemega 6 points7 points  (0 children)

    I would say, if you need a little operation here and there, pandas are fine. But if you are serious, use polars.

    [–]turbothy 2 points3 points  (0 children)

    If you don't know Pandas by now, count your lucky stars and pick up something actually useful instead.

    [–]lood9phee2Ri 1 point2 points  (7 children)

    I mean, I don't actually mind pandas particularly, but another thing you can do - if you want - is use sqlalchemy against a transient in-memory sqlite. Then use the same sqlalchemy stuff directly, as you would against real database. Faster than you might think (in-memory, duh).

    import sqlalchemy
    sql_engine = sqlalchemy.create_engine('sqlite+pysqlite:///:memory:')
    with sql_engine.connect() as sql_conn:
       sql_result = sql_conn.execute(sqlalchemy.text("SELECT 'Hello, World!';"))
       print(sql_result.all())
    

    =>

    [('Hello, World!',)]
    

    Anyway.

    [–]dannuic 0 points1 point  (0 children)

    I tend to reach for duckdb to create memory models in programs (and before that, I used sqlite). I've never needed to replace those implementations due to poor performance and it lets me get down to writing actual program logic faster. SQL, even limited implementations, are way better than trying to roll your own memory model every single time. Plus it's really easy to pickle state

    [–]huge_clock -3 points-2 points  (5 children)

    I would like to see a test. SQLite’s SQL implementation is incredibly limited and in-memory databases are notoriously slow and i have tested it.

    [–]elh0mbre 2 points3 points  (2 children)

    >  in-memory databases are notoriously slow

    wut? do you mean products that are specifically designed as in memory databases? Otherwise, "in memory" is as fast as a database gets.

    [–]huge_clock -2 points-1 points  (1 child)

    Single file “serverless” databases that run server side requests on the client machine. Notably SQLite but also MSAccess and DuckDB. Incredibly poor performance for business analytics. Might be fine for a small website with a limited number of users.

    [–]lood9phee2Ri 0 points1 point  (0 children)

    Single file “serverless” databases

    Are you sure you were testing in transient in-memory mode? "Single-file" kind of suggests you weren't, and are misunderstanding things - if you're hitting a file on persistent storage, of course it's slower than in-memory, even ssd/nvme is still slower than ram for now.

    https://sqlite.org/inmemorydb.html

    An SQLite database is normally stored in a single ordinary disk file. However, in certain circumstances, the database might be stored in memory.

    The most common way to force an SQLite database to exist purely in memory is to open the database using the special filename ":memory:".

    https://duckdb.org/docs/current/connect/overview#in-memory-database

    DuckDB can operate in in-memory mode. In most clients, this can be activated by passing the special value :memory: as the database file

    [–]Ralwus 3 points4 points  (1 child)

    Duckdb is incredibly fast. What have you tested that was slow?

    [–]lood9phee2Ri 0 points1 point  (0 children)

    Vaguely worth noting in context that despite it also having its own official python binding, there's also an SQLAlchemy driver for it

    https://duckdb.org/docs/current/clients/python/overview

    https://pypi.org/project/duckdb-sqlalchemy/

    =>

    sql_engine = create_engine("duckdb:///:memory:")
    

    [–]elh0mbre 1 point2 points  (1 child)

    Not a reason to reach for python or pandas, IMO.

    I would reach for SQL, if it's all in one DB. If its in microservices, I'd either be looking to consolidate the data for reporting like this in a data warehouse, or stitch the data together myself in a service (given that dotnet is my daily driver, LINQ would replace pandas aptly for me) if I have a good reason for it to not come from a warehouse (low latency requirements, as one example).

    I still don't understand the fascination with microservices, nor do I understand a lot of people's aversion to learning/understanding SQL. /shrug

    [–]dannuic 1 point2 points  (0 children)

    SQL has an incredibly stable and sensible syntax, but still gets constant improvement under the hood (especially if you're using postgres). I have no idea why software developers are so afraid of just learning SQL to do anything with data, either.